AntonelliLab / seqcap_processor

Bioinformatic pipeline for processing Sequence Capture data for Phylogenetics
MIT License
21 stars 12 forks source link

Example file format for secapr clean_reads #36

Closed nbat64 closed 1 year ago

nbat64 commented 1 year ago

Hello,

I am testing secapr pipeline. However, I have an issue at the beginning with the reads cleaning with fastp. My job end without outputs and only as error message, the argument for --read_min flag.

I have installed the pipeline in a mamba env, and my input are raw reads in fastq.gz secapr clean_reads --input $folder/raw/fastq/ --output $folder/cleaned/ --sample_annotation_file test.txt

Do you have an example for the sample_annotation_file?

I thank you in advance for the help

Regards

mlaize commented 1 year ago

Hello, There is a real data example here : http://htmlpreview.github.io/?https://github.com/AntonelliLab/seqcap_processor/blob/master/docs/documentation/tutorial.html to look how the data should be set up. I would bet that you have to unzip your fastq before and don't forget to rename xx_R1.fastq or xx_R2.fastq for paired end data.

Regards,

Mathias

nbat64 commented 1 year ago

Hello @mlaize yes, but the tutorial is for the previous version of the pipeline, when cleaning was made with Trimmomatic, not fastp. So the sample_annotation_file is different. I tried like this: name-something,name-something_R1.fastq.gz name-something,name-something_R2.fastq.gz

The clean_reads.py script start with the message:

Genus-species-NB01-107: Counting all reads (forward + reverse) belonging to this sample...
4968081
##################################################
Processing Genus-species-NB01-107...

But it gets stuck at this step, fastp seems to not produce any outputs, and there is no error message. I have run fastp outside the clean_reads, but it seems it cause then problem for assemble_reads with spades as it look for stats made by clean_reads.py I think?

Thanks, regards

tandermann commented 1 year ago

@nbat64, which version of SECAPR are you running (what does secapr -v give you as output)? You are correct that with the latest version that implements fastp for cleaning and trimming it is not necessary anymore to unzip the fastq-files. I haven't had time in several months to update the pipeline, so it is possible that there are some bugs. The last time I ran it, my adapter.txt file looked like this:

[adapters]
i7:GATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTTG
i5:AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

[names]
T_pyra1:_
T_pyra3:_
T_pella5:_
T_pella9:_

[barcodes]
i7-T_pyra1:ATTGAGGA
i7-T_pyra3:ATGCCTAA
i7-T_pella5:GTCTGTCA
i7-T_pella9:ACATTGGC

My fastq-samples were in one folder and were named like this:

T_pella5_R1.fastq
T_pella5_R2.fastq
T_pella9_R1.fastq
T_pella9_R2.fastq
T_pyra1_R1.fastq
T_pyra1_R2.fastq
T_pyra3_R1.fastq
T_pyra3_R2.fastq

I ran this command to clean:

secapr clean_reads --input pipeline_exercise/fastq_raw/ --config pipeline_exercise/adapter_info.txt --output pipeline_exercise/cleaned_trimmed_reads --index single

Let me know if that helps, I'll be more responsive now that I have some more time to work on SECAPR.

tandermann commented 1 year ago

although I realize now that those were unzipped fastq files, but it should also work for zipped ones in theory (let me know if it doesn't )

tandermann commented 1 year ago

Ignore the things I wrote above, that was for the old version. The latest development version takes as input under the --sample_annotation_file flag in secapr clean_reads a text file that looks like this:

T_pella5,RAPiD-Genomics_F226_GOT_130407_P001_WA01_i5-539_i7-59_S1986
T_pella9,RAPiD-Genomics_F226_GOT_130407_P001_WA02_i5-539_i7-27_S1987
T_pyra1,RAPiD-Genomics_F226_GOT_130407_P001_WA03_i5-539_i7-82_S1988
T_pyra3,RAPiD-Genomics_F226_GOT_130407_P001_WA04_i5-539_i7-7_S1989

The term before the comma is the name you want to assign to the given sample for all downstream operations. The term after the comma should be a string in the filename of the raw fastq file (zipped or unzipped) that uniquely identifies the respective sample.

For the --input flag i provided a folder with the zipped fastq files, with the filenames looking like this:

RAPiD-Genomics_F226_GOT_130407_P001_WA01_i5-539_i7-59_S1986_L001_R1_001.fastq.gz
RAPiD-Genomics_F226_GOT_130407_P001_WA01_i5-539_i7-59_S1986_L001_R2_001.fastq.gz
RAPiD-Genomics_F226_GOT_130407_P001_WA02_i5-539_i7-27_S1987_L001_R1_001.fastq.gz
RAPiD-Genomics_F226_GOT_130407_P001_WA02_i5-539_i7-27_S1987_L001_R2_001.fastq.gz
RAPiD-Genomics_F226_GOT_130407_P001_WA03_i5-539_i7-82_S1988_L001_R1_001.fastq.gz
RAPiD-Genomics_F226_GOT_130407_P001_WA03_i5-539_i7-82_S1988_L001_R2_001.fastq.gz
RAPiD-Genomics_F226_GOT_130407_P001_WA04_i5-539_i7-7_S1989_L001_R1_001.fastq.gz
RAPiD-Genomics_F226_GOT_130407_P001_WA04_i5-539_i7-7_S1989_L001_R2_001.fastq.gz

Let me know in case that doesn't work for you or in case you have any other questions.