emo-bon / MetaGOflow

MGnify oriented implementation for the Marine Genomic Observatories oriented pipeline, developed in the framework of an EOSC-Life funded project
https://metagoflow.readthedocs.io
Apache License 2.0
7 stars 7 forks source link

add descriptions of vars provided by the user #14

Closed hariszaf closed 1 year ago

hariszaf commented 1 year ago

This PR aims to provide descriptions of the arguments provided by the user though the config.yml file.

jprmachado commented 1 year ago

@hariszaf

The changes don't break the wf. It produces an output almost similar. I did notice some small differences in the output.

These files are absent in the output:

sequence-categorisation:
- alpha_tmrna.rf01849.fasta.gz
- metazoa_srp.rf00017.fasta.gz
- tmrna.rf00023.fasta.gz
diff of the parameters yml files:
7d6
< count_faa_from_previous_run: null
9a9
> diamond_maxTargetSeqs: 1

diamond_maxTargeSeqs was changed, so the differences are the expected. Do you think it's normal these files to be absent?

jprmachado commented 1 year ago

Additional to a better documentation it also solve examples of pseudo paths for partial runs. I think this PR is a good improvement.

hariszaf commented 1 year ago

Great, i will remove the unused variable and once i commit and yiu run a test again we can merge.

Question: how long does it take to run the pipeline with the small data you added ?!

hariszaf commented 1 year ago

I will check the output difs you mentioned asap.

jprmachado commented 1 year ago

Question: how long does it take to run the pipeline with the small data you added ?!

This one took around 4 hours.

jprmachado commented 1 year ago

I will check the output difs you mentioned asap.

I also left a few comments on the commit, just referring to check if you notice then, but they are just suggestions

hariszaf commented 1 year ago

@jprmachado could you run your test case and merge if everything's fine? Thanks

jprmachado commented 1 year ago

Results of test . ├── fastp.html ├── final.contigs.fa ├── functional-annotation │   ├── stats │   │   ├── go.stats │   │   ├── interproscan.stats │   │   ├── ko.stats │   │   ├── orf.stats │   │   └── pfam.stats │   ├── test.merged_CDS.I5.tsv.chunks │   ├── test.merged_CDS.I5.tsv.gz │   ├── test.merged.hmm.tsv.chunks │   ├── test.merged.hmm.tsv.gz │   ├── test.merged.summary.go │   ├── test.merged.summary.go_slim │   ├── test.merged.summary.ips │   ├── test.merged.summary.ko │   └── test.merged.summary.pfam ├── merged_qc │   ├── GC-distribution.out.full │   ├── GC-distribution.out.full_bin │   ├── GC-distribution.out.full_pcbin │   ├── nucleotide-distribution.out.full │   ├── seq-length.out.full │   ├── seq-length.out.full_bin │   ├── seq-length.out.full_pcbin │   └── summary.out ├── qc_summary ├── RNA-counts ├── sequence-categorisation │   ├── 5_8S.fa.gz │   ├── LSU.fasta.chunks │   ├── LSU.fasta.gz │   ├── LSU_rRNA_archaea.RF02540.fa.gz │   ├── LSU_rRNA_bacteria.RF02541.fa.gz │   ├── LSU_rRNA_eukarya.RF02543.fa.gz │   ├── SSU.fasta.chunks │   ├── SSU.fasta.gz │   ├── SSU_rRNA_bacteria.RF00177.fa.gz │   ├── SSU_rRNA_eukarya.RF01960.fa.gz │   └── tRNA.RF00005.fasta.gz ├── taxonomy-summary │   ├── LSU │   │   ├── krona.html │   │   ├── test.merged_LSU.fasta.mseq.gz │   │   ├── test.merged_LSU.fasta.mseq_hdf5.biom │   │   ├── test.merged_LSU.fasta.mseq_json.biom │   │   ├── test.merged_LSU.fasta.mseq.tsv │   │   └── test.merged_LSU.fasta.mseq.txt │   └── SSU │   ├── krona.html │   ├── test.merged_SSU.fasta.mseq.gz │   ├── test.merged_SSU.fasta.mseq_hdf5.biom │   ├── test.merged_SSU.fasta.mseq_json.biom │   ├── test.merged_SSU.fasta.mseq.tsv │   └── test.merged_SSU.fasta.mseq.txt ├── test_1_fwd_HWLTKDRXY_600000 │   ├── GC-distribution.out.full │   ├── GC-distribution.out.full_bin │   ├── GC-distribution.out.full_pcbin │   ├── nucleotide-distribution.out.full │   ├── seq-length.out.full │   ├── seq-length.out.full_bin │   ├── seq-length.out.full_pcbin │   └── summary.out ├── test_1_fwd_HWLTKDRXY_600000.fastq.gz.sha1 ├── test_1_fwd_HWLTKDRXY_600000.fastq.trimmed.fasta ├── test_2_rev_HWLTKDRXY_600000 │   ├── GC-distribution.out.full │   ├── GC-distribution.out.full_bin │   ├── GC-distribution.out.full_pcbin │   ├── nucleotide-distribution.out.full │   ├── seq-length.out.full │   ├── seq-length.out.full_bin │   ├── seq-length.out.full_pcbin │   └── summary.out ├── test_2_rev_HWLTKDRXY_600000.fastq.gz.sha1 ├── test_2_rev_HWLTKDRXY_600000.fastq.trimmed.fasta ├── test.merged_CDS.faa ├── test.merged_CDS.ffn ├── test.merged.cmsearch.all.tblout.deoverlapped ├── test.merged.fasta ├── test.merged.motus.tsv └── test.merged.unfiltered_fasta

9 directories, 75 files

Any change in the output related with sequence-characterization might be associated with the parameter diamond_maxTargetSeqs: 1

IMO we could merge this PR and that will close the #15 since that change is already done here

hariszaf commented 1 year ago

Looks like everything's up-to-date.. weird :grinning: