faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
80 stars 49 forks source link

Naming issue - Illumiprocessor config file #268

Closed sallesmath closed 2 years ago

sallesmath commented 2 years ago

Hello!

This has been discussed previously and I've read those posts , but I'm still having the following error:

illumiprocessor     --input raw-fastq     --output clean-fastq     --config new_illumiprocessor.conf     --cores 4     --r1-pattern _R1     --r2-pattern _R
2
2022-02-15 15:46:25,203 - illumiprocessor - INFO - ==================== Starting illumiprocessor ===================
2022-02-15 15:46:25,203 - illumiprocessor - INFO - Version: 2.10
2022-02-15 15:46:25,203 - illumiprocessor - INFO - Argument --config: new_illumiprocessor.conf
2022-02-15 15:46:25,203 - illumiprocessor - INFO - Argument --cores: 4
2022-02-15 15:46:25,203 - illumiprocessor - INFO - Argument --input: /home/sallesmath/dados_uce/raw-fastq
2022-02-15 15:46:25,203 - illumiprocessor - INFO - Argument --log_path: None
2022-02-15 15:46:25,204 - illumiprocessor - INFO - Argument --min_len: 40
2022-02-15 15:46:25,204 - illumiprocessor - INFO - Argument --no_merge: False
2022-02-15 15:46:25,204 - illumiprocessor - INFO - Argument --output: /home/sallesmath/dados_uce/clean-fastq
2022-02-15 15:46:25,204 - illumiprocessor - INFO - Argument --phred: phred33
2022-02-15 15:46:25,204 - illumiprocessor - INFO - Argument --r1_pattern: _R1
2022-02-15 15:46:25,204 - illumiprocessor - INFO - Argument --r2_pattern: _R2
2022-02-15 15:46:25,204 - illumiprocessor - INFO - Argument --se: False
2022-02-15 15:46:25,204 - illumiprocessor - INFO - Argument --trimmomatic: /home/sallesmath/miniconda3/envs/phyluce-1.7.1/bin/trimmomatic
2022-02-15 15:46:25,204 - illumiprocessor - INFO - Argument --verbosity: INFO
Traceback (most recent call last):
  File "/home/sallesmath/miniconda3/envs/phyluce-1.7.1/bin/illumiprocessor", line 17, in <module>
    sys.exit(main())
  File "/home/sallesmath/miniconda3/envs/phyluce-1.7.1/lib/python3.6/site-packages/illumiprocessor/cli/main.py", line 114, in main
    main(args)
  File "/home/sallesmath/miniconda3/envs/phyluce-1.7.1/lib/python3.6/site-packages/illumiprocessor/main.py", line 34, in main
    reads.append(core.SequenceData(args, conf, start_name, end_name))
  File "/home/sallesmath/miniconda3/envs/phyluce-1.7.1/lib/python3.6/site-packages/illumiprocessor/core.py", line 85, in __init__
    self._get_read_data()
  File "/home/sallesmath/miniconda3/envs/phyluce-1.7.1/lib/python3.6/site-packages/illumiprocessor/core.py", line 106, in _get_read_data
    "errors in your conf file.".format(self.start_name)
OSError: There is a problem with the read names for RAPiD-Genomics_F176_FUP_141801_P001_WA05. Ensure you do not have spelling/capitalization errors in your conf file.

These are my files:

-rwxrwxrwx 1 sallesmath sallesmath 260229837 Jan 24 18:18 RAPiD-Genomics_F176_FUP_141801_P001_WA05_i5-534_i7-38_S217_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 314512106 Jan 24 18:33 RAPiD-Genomics_F176_FUP_141801_P001_WA05_i5-534_i7-38_S217_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 244078973 Jan 24 18:33 RAPiD-Genomics_F176_FUP_141801_P001_WA06_i5-534_i7-74_S218_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 290782903 Jan 24 19:05 RAPiD-Genomics_F176_FUP_141801_P001_WA06_i5-534_i7-74_S218_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 243771396 Jan 24 19:11 RAPiD-Genomics_F176_FUP_141801_P001_WA09_i5-534_i7-36_S221_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 292212818 Jan 24 19:13 RAPiD-Genomics_F176_FUP_141801_P001_WA09_i5-534_i7-36_S221_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 190273914 Jan 24 19:11 RAPiD-Genomics_F176_FUP_141801_P001_WA10_i5-534_i7-54_S222_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 228411548 Jan 24 19:18 RAPiD-Genomics_F176_FUP_141801_P001_WA10_i5-534_i7-54_S222_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 212160650 Jan 24 19:18 RAPiD-Genomics_F176_FUP_141801_P001_WA11_i5-534_i7-25_S223_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 255392164 Jan 24 19:21 RAPiD-Genomics_F176_FUP_141801_P001_WA11_i5-534_i7-25_S223_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath  95138993 Jan 24 19:21 RAPiD-Genomics_F176_FUP_141801_P001_WA12_i5-534_i7-23_S224_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 116223869 Jan 24 19:22 RAPiD-Genomics_F176_FUP_141801_P001_WA12_i5-534_i7-23_S224_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 312703366 Feb  2 16:29 RAPiD-Genomics_F176_FUP_141801_P001_WB01_i5-534_i7-31_S225_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 372020696 Feb  2 16:33 RAPiD-Genomics_F176_FUP_141801_P001_WB01_i5-534_i7-31_S225_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 239417910 Feb  2 16:37 RAPiD-Genomics_F176_FUP_141801_P001_WB02_i5-534_i7-39_S226_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 282782992 Feb  2 16:03 RAPiD-Genomics_F176_FUP_141801_P001_WB02_i5-534_i7-39_S226_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 227597484 Feb  2 16:00 RAPiD-Genomics_F176_FUP_141801_P001_WB03_i5-534_i7-2_S227_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 273037339 Feb  2 16:03 RAPiD-Genomics_F176_FUP_141801_P001_WB03_i5-534_i7-2_S227_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 769251191 Feb  2 16:09 RAPiD-Genomics_F176_FUP_141801_P001_WB05_i5-534_i7-90_S229_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 921803276 Jan 24 21:58 RAPiD-Genomics_F176_FUP_141801_P001_WB05_i5-534_i7-90_S229_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 337325296 Jan 26 00:39 RAPiD-Genomics_F176_FUP_141801_P002_WA01_i5-535_i7-59_S308_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 409804219 Jan 26 00:44 RAPiD-Genomics_F176_FUP_141801_P002_WA01_i5-535_i7-59_S308_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 208856948 Jan 26 00:46 RAPiD-Genomics_F176_FUP_141801_P002_WA02_i5-535_i7-27_S309_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 257112483 Jan 26 00:51 RAPiD-Genomics_F176_FUP_141801_P002_WA02_i5-535_i7-27_S309_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 354261206 Jan 26 00:55 RAPiD-Genomics_F176_FUP_141801_P002_WA03_i5-535_i7-82_S310_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 436072875 Jan 26 00:59 RAPiD-Genomics_F176_FUP_141801_P002_WA03_i5-535_i7-82_S310_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 325515387 Jan 26 01:05 RAPiD-Genomics_F176_FUP_141801_P002_WA04_i5-535_i7-7_S311_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 400096313 Jan 26 01:12 RAPiD-Genomics_F176_FUP_141801_P002_WA04_i5-535_i7-7_S311_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 425516712 Jan 26 01:17 RAPiD-Genomics_F176_FUP_141801_P002_WA05_i5-535_i7-38_S312_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 527977136 Jan 26 01:24 RAPiD-Genomics_F176_FUP_141801_P002_WA05_i5-535_i7-38_S312_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 298956466 Jan 26 01:27 RAPiD-Genomics_F176_FUP_141801_P002_WA06_i5-535_i7-74_S313_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 376309875 Jan 26 01:44 RAPiD-Genomics_F176_FUP_141801_P002_WA06_i5-535_i7-74_S313_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 677738437 Jan 26 02:12 RAPiD-Genomics_F176_FUP_141801_P002_WB08_i5-535_i7-57_S327_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 821731675 Jan 26 02:18 RAPiD-Genomics_F176_FUP_141801_P002_WB08_i5-535_i7-57_S327_L002_R2_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 721307747 Jan 26 03:38 RAPiD-Genomics_F176_FUP_141801_P002_WC09_i5-535_i7-93_S340_L002_R1_001.fastq.gz
-rwxrwxrwx 1 sallesmath sallesmath 884475446 Jan 26 03:50 RAPiD-Genomics_F176_FUP_141801_P002_WC09_i5-535_i7-93_S340_L002_R2_001.fastq.gz

And this is my .config file:

[adapters]
i7:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTTG
i5:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

[tag sequences]
i5-1:CGACGTTA
i5-2:GACGAATG
i7-82:ATTGGCTC
i7-38:CCGTGAGA
i7-74:ACACGACC
i7-36:CCAGTTCA
i7-54:GGAGAACA
i7-25:AGATCGCA
i7-23:ACTATGCA
i7-31:CAAGACTA
i7-39:CCTCCTGA
i7-2:AAACATCG
i7-90:CGGATTGC
i7-59:GTGTTCTA
i7-27:AGTCACTA
i7-7:CAGATCTG
i7-57:GTCGTAGA
i7-93:GACAGTGC

[tag map]
RAPiD-Genomics_F176_FUP_141801_P001_WA05:i5-534,i7-38
RAPiD-Genomics_F176_FUP_141801_P001_WA06:i5-534,i7-74
RAPiD-Genomics_F176_FUP_141801_P001_WA09:i5-534,i7-36
RAPiD-Genomics_F176_FUP_141801_P001_WA10:i5-534,i7-54
RAPiD-Genomics_F176_FUP_141801_P001_WA11:i5-534,i7-25
RAPiD-Genomics_F176_FUP_141801_P001_WA12:i5-534,i7-23
RAPiD-Genomics_F176_FUP_141801_P001_WB01:i5-534,i7-31
RAPiD-Genomics_F176_FUP_141801_P001_WB02:i5-534,i7-39
RAPiD-Genomics_F176_FUP_141801_P001_WB03:i5-534,i7-2
RAPiD-Genomics_F176_FUP_141801_P001_WB05:i5-534,i7-90
RAPiD-Genomics_F176_FUP_141801_P002_WA01:i5-535,i7-59
RAPiD-Genomics_F176_FUP_141801_P002_WA02:i5-535,i7-27
RAPiD-Genomics_F176_FUP_141801_P002_WA03:i5-535,i7-82
RAPiD-Genomics_F176_FUP_141801_P002_WA04:i5-535,i7-7
RAPiD-Genomics_F176_FUP_141801_P002_WA05:i5-535,i7-38
RAPiD-Genomics_F176_FUP_141801_P002_WA06:i5-535,i7-74
RAPiD-Genomics_F176_FUP_141801_P002_WB08:i5-535,i7-57
RAPiD-Genomics_F176_FUP_141801_P002_WC09:i5-535,i7-93

[names]
RAPiD-Genomics_F176_FUP_141801_P001_WA05:amcc204349
RAPiD-Genomics_F176_FUP_141801_P001_WA06:mtr29592
RAPiD-Genomics_F176_FUP_141801_P001_WA09:chunb74137
RAPiD-Genomics_F176_FUP_141801_P001_WA10:chunb74145
RAPiD-Genomics_F176_FUP_141801_P001_WA11:chunb74146
RAPiD-Genomics_F176_FUP_141801_P001_WA12:chunb74147
RAPiD-Genomics_F176_FUP_141801_P001_WB01:chunb74155
RAPiD-Genomics_F176_FUP_141801_P001_WB02:chunb74156
RAPiD-Genomics_F176_FUP_141801_P001_WB03:chunb74157
RAPiD-Genomics_F176_FUP_141801_P001_WB05:chunb74163
RAPiD-Genomics_F176_FUP_141801_P002_WA01:chunb74162
RAPiD-Genomics_F176_FUP_141801_P002_WA02:amcc204498
RAPiD-Genomics_F176_FUP_141801_P002_WA03:lg15512987
RAPiD-Genomics_F176_FUP_141801_P002_WA04:amcc204357
RAPiD-Genomics_F176_FUP_141801_P002_WA05:mtr29503
RAPiD-Genomics_F176_FUP_141801_P002_WA06:amcc204384
RAPiD-Genomics_F176_FUP_141801_P002_WB08:amcc204510
RAPiD-Genomics_F176_FUP_141801_P002_WC09:mtr29622

I've already tried checking the configuration file several times, renamed the sequence files, changed the configuration of r1 and r2_pattern, but nothing worked.

One thing that caught my attention is that my error is of type OSError and not of type IOError as in previous cases. Could this be the possible reason for my error? If so, could you tell me why this happens?

Thanks!

brantfaircloth commented 2 years ago

This still seems like an issue with regular expressions (because your names are very long and somewhat non-standard). You should have 2 options:

  1. Change the config file so that the names you use are of the form (do same in the Tag Map section):
RAPiD-Genomics_F176_FUP_141801_P002_WB08_i5-535_i7-57_S327:amcc204510
RAPiD-Genomics_F176_FUP_141801_P002_WC09_i5-535_i7-93_S340:mtr29622

This should work with the existing R1 and R2 patterns.

  1. Alternatively, you can leave the config file as it is and update the R1 and R2 patterns to something like:
    --r1-pattern "{}_(?:.*)_(?:.*)_(?:.*)_(?:.*)_(R1|READ1|Read1|read1)_\\d+.fastq(?:.gz)*"
    --r2-pattern "{}_(?:.*)_(?:.*)_(?:.*)_(?:.*)_(R2|READ2|Read2|read2)_\\d+.fastq(?:.gz)*"
sallesmath commented 2 years ago

Thank you very much, Brant! I managed to solve.

Apparently I had two problems: 1) Illumiprocessor was not working inside phyluce for me. What I did was install illuminaprocessor directly through anaconda in a Python 2.7 virtual environment 2) The filenames were really long and probably illumiprocessor was having difficulties managing this kind of complexity. So I: i) renamed the raw_data files to simpler names; ii) ran illumiprocessor in subsets of sequences (~6 samples at a time), as suggested in previous posts; iii) used regular expressions similar to the ones you suggested.

Thanks again!

brantfaircloth commented 2 years ago

You are welcome, and I'm glad you got it working!