faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
80 stars 49 forks source link

Ilumiprocessor config file issues #306

Open NYX-PLUTO opened 1 year ago

NYX-PLUTO commented 1 year ago

Hello,

I have been dealing with a naming error that is similar to those above. The suggested solutions do not resolve this issue.

I have 40*2 = 80 total fastq.gz files located in the directory "working" that follow this structure S703_L003_R1_001.fastq.gz S703_L003_R2_001.fastq.gz

My configuration file is structured as: [adapters] i7:GATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTTG i5:AATGATACGGCGACCACCGAGATCTACAC*ACACTCTTTCCCTACACGACGCTCTTCCGATCT

[tag sequences] i7-128:TTCGAAGC i5-534:CGACGTTA

[tag map] S703:i7-128,i5-534

[names] S703:BME101020_Atorridus_KernCo_Caliente

My .sh file: illumiprocessor \ --input working \ --output clean-fastq \ --config illumiprocessor_rev.conf \ --cores 20 \ --r1-pattern "{}R1_\d+.fastq.gz" \ --r2-pattern "{}R2_\d+.fastq.gz"

(I have tried with {}_R1_\d+.fastq.gz and without the r1/r2 pattern flags as well)

The exact error I get: File "/home/hays/miniconda3/envs/phyluce-1.7.1/lib/python3.6/site-packages/illumiprocessor/core.py", line 106, in _get_read_data "errors in your conf file.".format(self.start_name) OSError: There is a problem with the read names for S703. Ensure you do not have spelling/capitalization errors in your conf file.

Thank you for your help.

rachel-weinberg commented 1 year ago

Hello,

I ran into this same problem, and what fixed it for me was including "L003" in the read pattern flags. So the command that actually ran for me was:

illumiprocessor \
    --input rs_fastq \
    --output rs_clean \
    --config illumiprocessor_rs.conf \
    --cores 8 \
    --r1-pattern "{}_L003_R1_\d+.fastq.gz" \
    --r2-pattern "{}_L003_R2_\d+.fastq.gz"

It also seems like any shared prefixes followed by an underscore in the sample names are sufficient to cause an error (this doesn't appear to be part of your issue, but I thought I would mention it in case anyone else is struggling with this). So, having samples named, for example, "RBW87_A" and "RBW87_B" caused an error regarding duplicate names (I don't have the specific error message because I resolved it already), but renaming samples as "RBW87A" and "RBW87B" seemed to fix the issue.

brantfaircloth commented 1 year ago

If your files follow the structure you've outlined at the top of your email, you will need something like:

--r1-pattern "{}_L\d+_R1_\d+.fastq.gz" \
--r2-pattern "{}_L\d+_R2_\d+.fastq.gz"

What Rachel suggests will also work (but won't work if you have L003 along with L004, L005, etc. in your read names).

-b