TORCH-Consortium / MAGMA

A pipeline for comprehensive genomic analyses of Mycobacterium tuberculosis with a focus on clinical decision making as well as research
https://doi.org/10.1371/journal.pcbi.1011648
GNU General Public License v3.0
13 stars 2 forks source link

Pipeline failing with older fastq file #216

Open RuanSpies21 opened 2 months ago

RuanSpies21 commented 2 months ago

Hi there,

I am trying to run the pipeline on some older fastq files (circa 2010s) using the docker profile. The reads for the files are relatively short at ~75bp. Following previous advice from Abhinav, I have created a custom.config file with contents:

profiles {
                bwa_k66 {
                                params {
                                                                BWA_MEM {
                                                                                arguments = " -k 66"
                                                                }
                                }
                }
}

which I specify with the -c argument. So my full command is: nextflow run . -params-file params/params.yaml -profile docker,server,bwa_k66 -c custom.config.

However I get this following error:

ERROR ~ Error executing process > 'UTILS_MERGE_COHORT_STATS (joint_name: joint)'

Caused by:
  Process `UTILS_MERGE_COHORT_STATS (joint_name: joint)` terminated with an error exit status (1)

Command executed:

  generate_merged_cohort_stats.py \
      --relabundance_approved_tsv approved_samples.relabundance.tsv \
      --relabundance_rejected_tsv rejected_samples.relabundance.tsv\
      --call_wf_cohort_stats_tsv joint.cohort_stats.tsv\
      --output_file joint.merged_cohort_stats.tsv

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/mnt/volume_data/ruan/walker_2013/MAGMA/bin/generate_merged_cohort_stats.py", line 55, in <module>
      df_final_cohort_stats['ALL_THRESHOLDS_MET'] = df_final_cohort_stats['MAPPED_NTM_FRACTION_16S_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['COVERAGE_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['BREADTH_OF_COVERAGE_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['RELABUNDANCE_THRESHOLD_MET'].astype('bool')
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/generic.py", line 6240, in astype
      new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 445, in astype
      return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 347, in apply
      applied = getattr(b, f)(**kwargs)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 526, in astype
      new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
      new_values = astype_array(values, dtype, copy=copy)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 227, in astype_array
      values = values.astype(dtype, copy=copy)
    File "/opt/conda/lib/python3.9/site-packages/pandas/core/arrays/masked.py", line 474, in astype
      raise ValueError("cannot convert float NaN to bool")
  ValueError: cannot convert float NaN to bool

Work dir:
  /mnt/volume_data/ruan/walker_2013/MAGMA/work/92/7f8941da70c68790f747afea230770

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

I think these is due to the sample returning with 0 coverage (when i check /mnt/volume_data/ruan/walker_2013/MAGMA/magma-results/QC_statistics/per_sample/coverage all have 0)

Any ideas what could be going on here or any workarounds? ERR038264 is an example fastq

Thanks! Ruan

abhi18av commented 2 months ago

Hi @RuanSpies21 ,

Happy to work on this together, could you please share 5 sample IDs from your dataset?

This way I can test those locally.

RuanSpies21 commented 2 months ago

Thanks @abhi18av!

ERR038276 ERR038277 ERR038278 ERR038279 ERR038280

RuanSpies21 commented 1 month ago

Hi @abhi18av - any thoughts on this yet?

abhi18av commented 1 month ago

Hi @RuanSpies21 ,

Apologies for the late response on this one, I has been able to reproduce this error on my side using the pipeline's default -k 100 for BWA, which completed in 30 seconds per sample.

image

This was NOT resolved even when I enabled bwa_k66 on my side with these samples, raising the runtime for BWA to roughly 40 seconds per sample.

The following statistics were generated for the individual files

|SAMPLE                   |AVG_INSERT_SIZE|MAPPED_PERCENTAGE|RAW_TOTAL_SEQS|AVERAGE_BASE_QUALITY|MEAN_COVERAGE|SD_COVERAGE|MEDIAN_COVERAGE|MAD_COVERAGE|PCT_EXC_ADAPTER|PCT_EXC_MAPQ|PCT_EXC_DUPE|PCT_EXC_UNPAIRED|PCT_EXC_BASEQ|PCT_EXC_OVERLAP|PCT_EXC_CAPPED|PCT_EXC_TOTAL|PCT_1X  |PCT_5X  |PCT_10X |PCT_30X |PCT_50X |PCT_100X|MAPPED_NTM_FRACTION_16S|MAPPED_NTM_FRACTION_16S_THRESHOLD_MET|COVERAGE_THRESHOLD_MET|BREADTH_OF_COVERAGE_THRESHOLD_MET|ALL_THRESHOLDS_MET|
|-------------------------|---------------|-----------------|--------------|--------------------|-------------|-----------|---------------|------------|---------------|------------|------------|----------------|-------------|---------------|--------------|-------------|--------|--------|--------|--------|--------|--------|-----------------------|-------------------------------------|----------------------|---------------------------------|------------------|
|MAGMA.ERX015472_ERR038276|366.5          |73.97            |17112670      |34.5                |154.672354   |71.416821  |157            |48          |0              |0.09915     |0.154413    |0               |0.02558      |0.001099       |0             |0.280241     |0.972886|0.96603 |0.961128|0.942092|0.916305|0.78371 |0.0                    |1                                    |1                     |1                                |1                 |
|MAGMA.ERX015473_ERR038277|384.5          |77.16            |18091946      |35.3                |176.428354   |71.317561  |185            |46          |0              |0.086099    |0.148593    |0               |0.020496     |0.000459       |0             |0.255647     |0.973516|0.966587|0.963094|0.950857|0.935023|0.859417|0.0                    |1                                    |1                     |1                                |1                 |
|MAGMA.ERX015474_ERR038278|407.9          |77.47            |13464688      |35.3                |134.847332   |58.306765  |142            |38          |0              |0.084936    |0.132313    |0               |0.020973     |0.000473       |0             |0.238694     |0.966827|0.959598|0.954376|0.933156|0.903887|0.750069|0.0                    |1                                    |1                     |1                                |1                 |
|MAGMA.ERX015475_ERR038279|427.5          |76.3             |16200744      |35.2                |155.460953   |61.334029  |165            |38          |0              |0.09051     |0.147023    |0               |0.021057     |0.000692       |0             |0.259282     |0.97162 |0.964273|0.960278|0.946518|0.928075|0.832158|0.0                    |1                                    |1                     |1                                |1                 |
|MAGMA.ERX015476_ERR038280|478.8          |75.39            |18525588      |35.2                |171.901534   |69.736791  |180            |42          |0              |0.096019    |0.158024    |0               |0.020307     |0.000607       |0             |0.274956     |0.973743|0.967281|0.96324 |0.949922|0.934053|0.859584|0.0                    |1                                    |1                     |1                                |1                 |

And I was able to reproduce the issue related to type casting in python script

INFO:    Converting SIF file to temporary sandbox...
Traceback (most recent call last):
  File "/home/abhinav/.nextflow/assets/TORCH-Consortium/MAGMA/bin/generate_merged_cohort_stats.py", line 55, in <module>
    df_final_cohort_stats['ALL_THRESHOLDS_MET'] = df_final_cohort_stats['MAPPED_NTM_FRACTION_16S_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['COVERAGE_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['BREADTH_OF_COVERAGE_THRESHOLD_MET'].astype('bool')  & df_final_cohort_stats['RELABUNDANCE_THRESHOLD_MET'].astype('bool')
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/generic.py", line 6240, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 448, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 352, in apply
    applied = getattr(b, f)(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 526, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 227, in astype_array
    values = values.astype(dtype, copy=copy)
  File "/opt/conda/lib/python3.10/site-packages/pandas/core/arrays/masked.py", line 474, in astype
    raise ValueError("cannot convert float NaN to bool")
ValueError: cannot convert float NaN to bool
INFO:    Cleaning up image...

NOTE

I am currently working on a patch to address this issue - thank you for bringing it to my attention!

abhi18av commented 1 month ago

@RuanSpies21 , could you please try running the pipeline with the following command? I have pushed a patch to master branch now.

NOTE: Please replace whatever makes sense in your context, but the main snippet is -r master -latest -resume

nextflow run 'https://github.com/TORCH-Consortium/MAGMA'
         -profile singularity,bwa_k66
         -r master
         -latest
         -resume
         -params-file params.magma.yaml
RuanSpies21 commented 1 month ago

Thank you so much for the help @abhi18av! I'm so sorry, I am not quite getting it right :(

When I run nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server,bwa_k66 -params-file params.yaml -r master -latest -resume I get: Unknown configuration profile: 'bwa_k66'

If I then add the -c custom.config with the file mentioned above I get ERROR ~ Unknown method invocationsplitJsonon UnixPath type

Seems to be an issue with sample sheet validation? Here is the format of my sample sheet for reference:

Sample,R1,R2
ERR025842,/mnt/volume_data/ruan/walker_2013/ERR025842_1.fastq.gz,/mnt/volume_data/ruan/walker_2013/ERR025842_2.fastq.gz
ERR025843,/mnt/volume_data/ruan/walker_2013/ERR025843_1.fastq.gz,/mnt/volume_data/ruan/walker_2013/ERR025843_2.fastq.gz

I've also attached the nextflow logs in case helpful.

Thanks again for your help - very sorry to keep bothering! nextflow.log

abhi18av commented 1 month ago

Hi @RuanSpies21

test profile

The samplesheet sheet looks fine to me, but let's make sure that the basics are all set

nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server,test -r hotfix/bwa_k66

This should make use of the test profile and download some samples from original MAGMA publication and run them through.

bwa_k66 profile

I have created a new bwa_k66 profile, which you can use without providing a -c custom.config file.

nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server -r hotfix/bwa_k66 --input_samplesheet /path/to/your/samplesheet.csv

Seems to be an issue with sample sheet validation?

Actually, to me the samplesheet seems valid 🤔

Thanks again for your help - very sorry to keep bothering!

No worries at all Ruan, this is very helpful. There's no perfect software, but with user feedback and usage, we can keep improving it.

I do thank you for your patience!


If this doesn't work, then perhaps we can meet sometime next week? Here's my academic email abhinavsharma at sun dot ac dot za 📆

RuanSpies21 commented 1 month ago

Ok its looks like its failing with the same error on the test profile as well.

I ran nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server,test -r hotfix/bwa_k66

Output:

process > SAMPLESHEET_VALIDATION                                                                [  0%] 0 of 1
[-        ] process > VALIDATE_FASTQS_WF:FASTQ_VALIDATOR                                                    -
[-        ] process > VALIDATE_FASTQS_WF:UTILS_FASTQ_COHORT_VALIDATION                                      -
[-        ] process > QUALITY_CHECK_WF:FASTQC                                                               -
[-        ] process > QUALITY_CHECK_WF:NTMPROFILER_PROFILE                                                  -
[-        ] process > QUALITY_CHECK_WF:NTMPROFILER_COLLATE                                                  -
[-        ] process > MAP_WF:BWA_MEM                                                                        -
[-        ] process > CALL_WF:SAMTOOLS_MERGE                                                                -
[-        ] process > CALL_WF:GATK_MARK_DUPLICATES                                                          -
[-        ] process > CALL_WF:SAMTOOLS_INDEX                                                                -
[-        ] process > CALL_WF:GATK_HAPLOTYPE_CALLER                                                         -
[-        ] process > CALL_WF:LOFREQ_CALL__NTM                                                              -
[-        ] process > CALL_WF:LOFREQ_INDELQUAL                                                              -
[-        ] process > CALL_WF:SAMTOOLS_INDEX__LOFREQ                                                        -
[-        ] process > CALL_WF:LOFREQ_CALL                                                                   -
[-        ] process > CALL_WF:LOFREQ_FILTER                                                                 -
[-        ] process > CALL_WF:UTILS_REFORMAT_LOFREQ                                                         -
[-        ] process > CALL_WF:BGZIP__LOFREQ                                                                 -
[-        ] process > CALL_WF:GATK_INDEX_FEATURE_FILE__LOFREQ                                               -
[-        ] process > CALL_WF:SAMTOOLS_STATS                                                                -
[-        ] process > CALL_WF:GATK_COLLECT_WGS_METRICS                                                      -
[-        ] process > CALL_WF:GATK_FLAG_STAT                                                                -
[-        ] process > CALL_WF:UTILS_SAMPLE_STATS                                                            -
[-        ] process > CALL_WF:UTILS_COHORT_STATS                                                            -
[-        ] process > MINOR_VARIANTS_ANALYSIS_WF:BCFTOOLS_MERGE__LOFREQ                                     -
[-        ] process > MINOR_VARIANTS_ANALYSIS_WF:TBPROFILER_VCF_PROFILE__LOFREQ                             -
[-        ] process > MINOR_VARIANTS_ANALYSIS_WF:TBPROFILER_COLLATE__LOFREQ                                 -
[-        ] process > MINOR_VARIANTS_ANALYSIS_WF:UTILS_MULTIPLE_INFECTION_FILTER                            -
[-        ] process > UTILS_MERGE_COHORT_STATS                                                              -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:BWA_MEM__DELLY                                        -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:SAMTOOLS_MERGE__DELLY                                 -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:GATK_MARK_DUPLICATES__DELLY                           -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:SAMTOOLS_INDEX__DELLY                                 -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:DELLY_CALL                                            -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:BCFTOOLS_VIEW__DELLY                                  -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:BCFTOOLS_MERGE__DELLY                                 -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:TBPROFILER_VCF_PROFILE__DELLY                         -
[-        ] process > STRUCTURAL_VARIANTS_ANALYSIS_WF:TBPROFILER_COLLATE__DELLY                             -
[-        ] process > MERGE_WF:PREPARE_COHORT_VCF:GATK_COMBINE_GVCFS                                        -
[-        ] process > MERGE_WF:PREPARE_COHORT_VCF:GATK_GENOTYPE_GVCFS                                       -
[-        ] process > MERGE_WF:PREPARE_COHORT_VCF:SNPEFF                                                    -
[-        ] process > MERGE_WF:PREPARE_COHORT_VCF:BGZIP                                                     -
[-        ] process > MERGE_WF:PREPARE_COHORT_VCF:GATK_INDEX_FEATURE_FILE__COHORT                           -
[-        ] process > MERGE_WF:SNP_ANALYSIS:GATK_SELECT_VARIANTS__SNP                                       -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN7  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN7 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN6  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN6 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN5  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN5 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN4  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN4 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN3  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN3 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:GATK_VARIANT_RECALIBRATOR__ANN2  -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_ELIMINATE_ANNOTATION__ANN2 -
[-        ] process > MERGE_WF:SNP_ANALYSIS:OPTIMIZE_VARIANT_RECALIBRATION:UTILS_SELECT_BEST_ANNOTATIONS    -
[-        ] process > MERGE_WF:SNP_ANALYSIS:GATK_APPLY_VQSR__SNP                                            -
[-        ] process > MERGE_WF:SNP_ANALYSIS:GATK_SELECT_VARIANTS__EXCLUSION__SNP                            -
[-        ] process > MERGE_WF:INDEL_ANALYSIS:GATK_SELECT_VARIANTS__INDEL                                   -
[-        ] process > MERGE_WF:GATK_MERGE_VCFS__INC                                                         -
[-        ] process > MERGE_WF:MAJOR_VARIANT_ANALYSIS:TBPROFILER_VCF_PROFILE__COHORT                        -
[-        ] process > MERGE_WF:MAJOR_VARIANT_ANALYSIS:TBPROFILER_COLLATE__COHORT                            -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:GATK_SELECT_VARIANTS__PHYLOGENY                -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:GATK_VARIANTS_TO_TABLE                         -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:SNPSITES                                       -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:SNPDISTS                                       -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__EXCOMPLEX:IQTREE                                         -
[-        ] process > MERGE_WF:CLUSTER_ANALYSIS__EXCOMPLEX:CLUSTERPICKER__5SNP                              -
[-        ] process > MERGE_WF:CLUSTER_ANALYSIS__EXCOMPLEX:CLUSTERPICKER__12SNP                             -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:GATK_SELECT_VARIANTS__PHYLOGENY               -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:GATK_VARIANTS_TO_TABLE                        -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:SNPSITES                                      -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:SNPDISTS                                      -
[-        ] process > MERGE_WF:PHYLOGENY_ANALYSIS__INCCOMPLEX:IQTREE                                        -
[-        ] process > MERGE_WF:CLUSTER_ANALYSIS__INCCOMPLEX:CLUSTERPICKER__5SNP                             -
[-        ] process > MERGE_WF:CLUSTER_ANALYSIS__INCCOMPLEX:CLUSTERPICKER__12SNP                            -
[-        ] process > REPORTS_WF:MULTIQC                                                                    -
[-        ] process > REPORTS_WF:UTILS_SUMMARIZE_RESISTANCE_RESULTS                                         -
[-        ] process > REPORTS_WF:UTILS_SUMMARIZE_RESISTANCE_RESULTS_MIXED_INFECTION                         -
WARN: There's no process matching config selector: VALIDATE_FASTQS_WF:SAMPLESHEET_VALIDATION
ERROR ~ Unknown method invocation `splitJson` on UnixPath type

 -- Check '.nextflow.log' file for details
abhi18av commented 1 month ago

Then, I think the problem might be with you Java setup, could you please confirm you're using an LTS version as mentioned here https://github.com/TORCH-Consortium/MAGMA?tab=readme-ov-file#nextflow ?

RuanSpies21 commented 1 month ago

I can confirm I'm using a LTS version of Java 17.

I don't seem to get the same error when using the alpha pre-release of v2.0.0 nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server -r v2.0.0-alpha -params-file params.yaml

In this case the pipeline runs successfully through the samplesheet validation step

abhi18av commented 1 month ago

Mmm, then the next suspect is the version of Nextflow, which I think should fix the problem

Could you please test with the following command? 🙏

NXF_VER=24.04.4 nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server,test -r hotfix/bwa_k66

If this works, then I will set the minimum nextflow version to 24.04.x in the pipeline and you should upgrade by typing nextflow -self-update

RuanSpies21 commented 1 month ago

Ok great! Test seems to have worked. Thanks for the help. Will give it a bash with these old sequences now - holding thumbs, will let you know how it goes.

RuanSpies21 commented 1 month ago
nextflow run 'https://github.com/TORCH-Consortium/MAGMA' -profile docker,server,test -r hotfix/bwa_k66

Just getting loads of fails for VALIDATE_FASTQS_WF:FASTQ_VALIDATOR [100%] 14 of 14, failed: 12, retries: 8 ✔ - from test profile. So only 1 of the 3 samples is actually processed [100%] 830 of 830, failed: 738, retries: 492 ✔ - from my sequences. Only 46/169 samples processed

abhi18av commented 1 month ago

Good so we're past the setup issues.

[100%] 14 of 14, failed: 12, retries: 8 ✔ - from test profile. So only 1 of the 3 samples is actually processed

I wouldn't worry too much about the samples from test since often while downloading samples from NCBI (FTP) they get corrupted in transit if the network or disk performance is not good.

[100%] 830 of 830, failed: 738, retries: 492 ✔ - from my sequences. Only 46/169 samples processed

So it seems that these samples are likely to be either corrupted while downloading or moving across external disks/computers.

⚠️ That is the reason why we ended up adding a separate VALIDATE_FASTQS_WF:FASTQ_VALIDATOR process.

One file which you might want to inspect is the QC_statistics/cohort/fastq_validation/magma_analysis.json file which should gather information about the files such as md5sum and size along with stats generated by seqkit etc. Perhaps that might be useful in debugging the failing samples.

abhi18av commented 1 month ago

I'd recommend you download your samples from NCBI/ENA using nf-core/fetchngs pipeline https://nf-co.re/fetchngs/1.12.0/docs/usage/ which makes sure the samples are not corrupted.

RuanSpies21 commented 1 month ago

Thanks for this @abhi18av. Its a long journey we have been on together now 😂. It seems the pipeline really does not like these old files.

I re-downloaded some of them with nf-core/fetchngs but large amounts of fails persist at VALIDATE_FASTQS_WF:FASTQ_VALIDATOR.

Further, those that do pass have 0 coverage. magma_analysis.json and joint.merged_cohort_stats.tsv attached for interest.

As a sanity check, a batch of newer fastqs processed successfully so set up is fine. [magma_analysis.json] (https://github.com/user-attachments/files/17054048/magma_analysis.json) joint.merged_cohort_stats.txt

abhi18av commented 1 month ago

Hi @RuanSpies21

It seems the pipeline really does not like these old files.

Actually, I would need more evidence to believe that - since we've been using MAGMA to analyse all Brazilian and South African sequences from SRA, produced in last 20 years, and unless there's something wrong with the samples themselves they get through.

That is the reason, why we added the JSON file so that we can have a better overview of the samples which failed. Could you please share that JSON QC_statistics/cohort/fastq_validation/magma_analysis.json file with me?

Further, those that do pass have 0 coverage.

Indeed, the results here are very suspicious, I will try to run these samples on my end to see if they are atleast reproducible

SAMPLE AVG_INSERT_SIZE MAPPED_PERCENTAGE RAW_TOTAL_SEQS AVERAGE_BASE_QUALITY MEAN_COVERAGE SD_COVERAGE MEDIAN_COVERAGE MAD_COVERAGE PCT_EXC_ADAPTER PCT_EXC_MAPQ PCT_EXC_DUPE PCT_EXC_UNPAIRED PCT_EXC_BASEQ PCT_EXC_OVERLAP PCT_EXC_CAPPED PCT_EXC_TOTAL PCT_1X PCT_5X PCT_10X PCT_30X PCT_50X PCT_100X LINEAGES FREQUENCIES MAPPED_NTM_FRACTION_16S MAPPED_NTM_FRACTION_16S_THRESHOLD_MET COVERAGE_THRESHOLD_MET BREADTH_OF_COVERAGE_THRESHOLD_MET RELABUNDANCE_THRESHOLD_MET ALL_THRESHOLDS_MET
MAGMA.ERX023849_ERR046787 0.0 0.0 7272484.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023851_ERR046789 0.0 0.0 3365516.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023852_ERR046790 0.0 0.0 7425878.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023853_ERR046791 0.0 0.0 6207574.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023885_ERR046823 0.0 0.0 6324052.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023913_ERR046851 0.0 0.0 6399012.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX023975_ERR046913 0.0 0.0 6674844.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX024002_ERR046940 0.0 0.0 6617920.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX024012_ERR046950 0.0 0.0 6311164.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX049831_ERR072065 0.0 0.0 3248118.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX049843_ERR072077 0.0 0.0 3311862.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
MAGMA.ERX049846_ERR072080 0.0 0.0 2881802.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0 0 0
RuanSpies21 commented 1 month ago

Ah ok I see.

Here is the QC_statistics/cohort/fastq_validation/magma_analysis.json file magma_analysis.json

abhi18av commented 1 month ago

Hi @RuanSpies21 , just letting you know that I'm still tracking this, just running across some resource contraints these days on our shared server.

RuanSpies21 commented 1 month ago

No worries @abhi18av! Thank you so much - have already been so accommodating