MetaSUB-CAMP / camp_short-read-quality-control

Other
4 stars 3 forks source link

incomplete paths to samples in final_reports/samples.csv output file - correct Snakefile #12

Closed katkopera closed 1 year ago

katkopera commented 1 year ago

Hi, one of the outputs of the module is
/path/to/work/dir/short_read_qc/final_reports/samples.csv which is supposed to be ingested by the next module e.g. camp_short-read-assembly.

Paths to forward and reverse reads in this output file (while using tadpole) are incomplete which leads to the failure of the next modules, for example in camp_short-read-assembly:

MissingInputException in line 51 of /net/ascratch/people/plgkkopera/camp_tests/short-read-assembly/camp_short-read-assembly/workflow/Snakefile:
Missing input files for rule metaspades_assembly:
    output: /net/ascratch/people/plgkkopera/camp_tests/short-read-assembly/short-read-assembly/0_metaspades/2_07052020_S27/scaffolds.fasta

In rule make_config of Snakefile you specify final fastqs directory as:

rule make_config:
    [...]
    params: 
        fastq_dir = join(dirs.OUT, '3_error_removal'),
        samples = SAMPLES,
    [...]
        for i in params.samples:
            s = i.split('/')[-1]
     [...]
            dct[s]['illumina_fwd'] = join(params.fastq_dir, s + '_1.fastq.gz')
            dct[s]['illumina_rev'] = join(params.fastq_dir, s + '_2.fastq.gz')
    [...]

but in rule filter_seq_errors_tp the paths are encoded differently:

rule filter_seq_errors_tp:
    [...]
    output:
        fwd = join(dirs.OUT, '3_error_removal', 'tadpole', '{sample}_1.fastq.gz'),
        rev = join(dirs.OUT, '3_error_removal', 'tadpole', '{sample}_2.fastq.gz'),
    [...]

so the 'tadpole' subdirectory is missing in samples.csv file

Similar for the bayeshammer option Please fix that :)

tommyfuu commented 1 year ago

hi, thanks for the comment! can i see your command used to run the assembly module real quick? thanks

katkopera commented 1 year ago

I am not sure if you have fully read the description of the problem I am reporting.

Of course I can define the samples.csv file myself and I know how to do it but my point is that the QC module documentation suggests that the output I get after quality control I can directly use (without editing) in subsequent modules. And I assume that this is what your intention was.

Did you have both 'tadpole' and 'bayeshmmer' methods from the beginning? It looks like, the second method was only added at some stage and not the whole Snakefile was updated after the change.

If users are to use the samples.csv file directly (output from QC) then the Snakefile needs to be changed.

tommyfuu commented 1 year ago

oh i see! that makes a lot of sense. i did read the full description, but was hoping that the cause of the problem was somewhat simpler and that would be the best scenario because it would be easy to fix.

was also curious when the last time you pulled the repository was? while i was not the last person to have edited the code, i did see that in order for this pipeline to finish running, the samples.csv file as you mentioned is required to be generated, in case the user needs it for the next step, as seen in line 67 of the Snakefile.

if you wish we could potentially do a really quick zoom call before 4pm today and see we could troubleshoot, maybe that can help you with the issue and help me understand where the problem came from as well. thanks!

also as a sidenote, tadpole was added later than bayeshammer - if there's some problem running bayeshammer feel free to let me know as well! It is going to be a lot slower than tadpole, that's why we added tadpole, but hopefully a lot more accurate (which also potentially run the risk of removing more reads and lead to a really small or empty resulting sequence file).

tommyfuu commented 1 year ago

or if you could potentially provide the command you run that could replicate the error you ran into so we can look into it that'll be fabulous as well. thanks!

katkopera commented 1 year ago

The command I run was python /net/ascratch/people/plgkkopera/camp_tests/short-read-assembly/camp_short-read-assembly/workflow/short-read-assembly.py -c 20 -d /net/ascratch/people/plgkkopera/camp_tests/short-read-assembly -s /net/ascratch/people/plgkkopera/camp_tests/short-read-quality-control/short_read_qc/final_reports/samples.csv

I was able to remove error by manually adding 'tadpole' subdirectory to the fwd and rev sample paths.

lauren-mak commented 1 year ago

This will be fully fixed in the next version of the module. Thanks!