fmalmeida / ngs-preprocess

A pipeline for preprocessing NGS data from Illumina, Nanopore and PacBio technologies
https://ngs-preprocess.readthedocs.io/
GNU General Public License v3.0
30 stars 4 forks source link

SRA fetch and preprocess of illumina with FASTP hangs #42

Open 0karl0 opened 2 weeks ago

0karl0 commented 2 weeks ago

This code

nextflow run fmalmeida/ngs-preprocess   -r dev -latest -profile docker --sra_ids "./input/sra_ids.txt"   --output illumina_single  --shortreads_type "single"   --fastp_additional_parameters " --trim_front1 5 --trim_tail1 5 "

hangs during procesing on 0/1 for the FASTP process

[54/9c682c] process > SRA_FETCH:GET_FASTQ (SRR28776895)    [100%] 1 of 1 ✔
[32/48d60d] process > SRA_FETCH:GET_METADATA (SRR28776895) [100%] 1 of 1 ✔
[-        ] process > NANOPORE:PORECHOP                    -
[-        ] process > NANOPORE:FILTER                      -
[-        ] process > NANOPORE:NANOPACK                    -
[-        ] process > PACBIO:BAM2FASTQ                     -
[-        ] process > PACBIO:NANOPACK                      -
[-        ] process > PACBIO:FILTER                        -
[17/78fe41] process > ILLUMINA:FASTP (SRR28776895)         [  0%] 0 of 1

However, there are 3 fastq files produced and following the previous command with this command completes the preprocessing:

nextflow run fmalmeida/ngs-preprocess   -r dev -latest -profile docker   --shortreads "illumina_single/SRA_FETCH/FASTQ/SRR28776895_data/*.fastq.gz" \                                
   --output illumina_single  --shortreads_type "single"   --fastp_additional_parameters " --trim_front1 5 --trim_tail1 5 " 
executor >  local (3)
[-        ] process > SRA_FETCH:GET_FASTQ            -
[-        ] process > SRA_FETCH:GET_METADATA         -
[-        ] process > NANOPORE:PORECHOP              -
[-        ] process > NANOPORE:FILTER                -
[-        ] process > NANOPORE:NANOPACK              -
[-        ] process > PACBIO:BAM2FASTQ               -
[-        ] process > PACBIO:NANOPACK                -
[-        ] process > PACBIO:FILTER                  -
[64/63cded] process > ILLUMINA:FASTP (SRR28776895_2) [100%] 3 of 3 ✔

My guess is the nextflow does not point to the downloaded SRA files automatically. Perhaps there's a flag I missed.

fmalmeida commented 2 weeks ago

Hi @0karl0 , Thanks for flagging this. I am going to investigate it further, however cannot commit on a deadline. It is good to know that you could get your data with a workaround.

Once I have updates, I can update the ticket with them.

Cheers.

fmalmeida commented 3 days ago

Hi @0karl0 , I figured it out the problem is that this particular study has technical reads. Thus, three files were being downloaded instead of only two what was expected since study is paired end.

What I can do is, provide a parameter to allow or not for technical reads and try to fix its processing. Or make it always skip it.

What do you think would be preferable in this scenario?

fmalmeida commented 3 days ago

I would probably vouch for skipping entirely the technical reads because they are more relevant for single cell data.

And I doubt people downloading single cell data would use an automation like this one.

But, would prefer to hear some inputs first.