faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

Ilumiprocessor config file issues #306

Open NYX-PLUTO opened 12 months ago

NYX-PLUTO commented 12 months ago

Hello,

I have been dealing with a naming error that is similar to those above. The suggested solutions do not resolve this issue.

I have 40*2 = 80 total fastq.gz files located in the directory "working" that follow this structure S703_L003_R1_001.fastq.gz S703_L003_R2_001.fastq.gz

My configuration file is structured as: [adapters] i7:GATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTTG i5:AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGATCT

[tag sequences] i7-128:TTCGAAGC i5-534:CGACGTTA

[tag map] S703:i7-128,i5-534

[names] S703:BME101020_Atorridus_KernCo_Caliente

My .sh file: illumiprocessor \ --input working \ --output clean-fastq \ --config illumiprocessor_rev.conf \ --cores 20 \ --r1-pattern "{}R1_\d+.fastq.gz" \ --r2-pattern "{}R2_\d+.fastq.gz"

(I have tried with {}_R1_\d+.fastq.gz and without the r1/r2 pattern flags as well)

The exact error I get: File "/home/hays/miniconda3/envs/phyluce-1.7.1/lib/python3.6/site-packages/illumiprocessor/core.py", line 106, in _get_read_data "errors in your conf file.".format(self.start_name) OSError: There is a problem with the read names for S703. Ensure you do not have spelling/capitalization errors in your conf file.

Thank you for your help.

rachel-weinberg commented 12 months ago

Hello,

I ran into this same problem, and what fixed it for me was including "L003" in the read pattern flags. So the command that actually ran for me was:

illumiprocessor \
    --input rs_fastq \
    --output rs_clean \
    --config illumiprocessor_rs.conf \
    --cores 8 \
    --r1-pattern "{}_L003_R1_\d+.fastq.gz" \
    --r2-pattern "{}_L003_R2_\d+.fastq.gz"

It also seems like any shared prefixes followed by an underscore in the sample names are sufficient to cause an error (this doesn't appear to be part of your issue, but I thought I would mention it in case anyone else is struggling with this). So, having samples named, for example, "RBW87_A" and "RBW87_B" caused an error regarding duplicate names (I don't have the specific error message because I resolved it already), but renaming samples as "RBW87A" and "RBW87B" seemed to fix the issue.

brantfaircloth commented 11 months ago

If your files follow the structure you've outlined at the top of your email, you will need something like:

--r1-pattern "{}_L\d+_R1_\d+.fastq.gz" \
--r2-pattern "{}_L\d+_R2_\d+.fastq.gz"

What Rachel suggests will also work (but won't work if you have L003 along with L004, L005, etc. in your read names).

-b