emo-bon / MetaGOflow

MGnify oriented implementation for the Marine Genomic Observatories oriented pipeline, developed in the framework of an EOSC-Life funded project
https://metagoflow.readthedocs.io
Apache License 2.0
7 stars 7 forks source link

adds new dataset for quick runs #12

Closed jprmachado closed 1 year ago

jprmachado commented 1 year ago

This PR adds a new dataset that is sampled from a original full data using seqkt

seqtk sample -s100 DBB_AAABOSDA_1_2_HWLTKDRXY.UDI208_clean.fastq 600000 > HWLTKDRXY_600000.fastq

Using parallel, in around 5 hours the workflow can produce the results. Even without parallel I don't expect that will be much more than 5 hours (but haven't tested).

This can be used for further development and refinement of the workflow.

Output

. ├── fastp.html ├── final.contigs.fa ├── functional-annotation │ ├── stats │ │ ├── go.stats │ │ ├── interproscan.stats │ │ ├── ko.stats │ │ ├── orf.stats │ │ └── pfam.stats │ ├── subset.merged_CDS.I5.tsv.chunks │ ├── subset.merged_CDS.I5.tsv.gz │ ├── subset.merged.hmm.tsv.chunks │ ├── subset.merged.hmm.tsv.gz │ ├── subset.merged.summary.go │ ├── subset.merged.summary.go_slim │ ├── subset.merged.summary.ips │ ├── subset.merged.summary.ko │ └── subset.merged.summary.pfam ├── merged_qc │ ├── GC-distribution.out.full │ ├── GC-distribution.out.full_bin │ ├── GC-distribution.out.full_pcbin │ ├── nucleotide-distribution.out.full │ ├── seq-length.out.full │ ├── seq-length.out.full_bin │ ├── seq-length.out.full_pcbin │ └── summary.out ├── qc_summary ├── RNA-counts ├── sequence-categorisation │ ├── 5_8S.fa.gz │ ├── alpha_tmRNA.RF01849.fasta.gz │ ├── LSU.fasta.chunks │ ├── LSU.fasta.gz │ ├── LSU_rRNA_archaea.RF02540.fa.gz │ ├── LSU_rRNA_bacteria.RF02541.fa.gz │ ├── LSU_rRNA_eukarya.RF02543.fa.gz │ ├── Metazoa_SRP.RF00017.fasta.gz │ ├── SSU.fasta.chunks │ ├── SSU.fasta.gz │ ├── SSU_rRNA_bacteria.RF00177.fa.gz │ ├── SSU_rRNA_eukarya.RF01960.fa.gz │ ├── tmRNA.RF00023.fasta.gz │ └── tRNA.RF00005.fasta.gz ├── subset_1_HWLTKDRXY_600000 │ ├── GC-distribution.out.full │ ├── GC-distribution.out.full_bin │ ├── GC-distribution.out.full_pcbin │ ├── nucleotide-distribution.out.full │ ├── seq-length.out.full │ ├── seq-length.out.full_bin │ ├── seq-length.out.full_pcbin │ └── summary.out ├── subset_1_HWLTKDRXY_600000.fastq.gz.sha1 ├── subset_1_HWLTKDRXY_600000.fastq.trimmed.fasta ├── subset_2_HWLTKDRXY_600000 │ ├── GC-distribution.out.full │ ├── GC-distribution.out.full_bin │ ├── GC-distribution.out.full_pcbin │ ├── nucleotide-distribution.out.full │ ├── seq-length.out.full │ ├── seq-length.out.full_bin │ ├── seq-length.out.full_pcbin │ └── summary.out ├── subset_2_HWLTKDRXY_600000.fastq.gz.sha1 ├── subset_2_HWLTKDRXY_600000.fastq.trimmed.fasta ├── subset.merged_CDS.faa ├── subset.merged_CDS.ffn ├── subset.merged.cmsearch.all.tblout.deoverlapped ├── subset.merged.fasta ├── subset.merged.motus.tsv ├── subset.merged.unfiltered_fasta └── taxonomy-summary ├── LSU │ ├── krona.html │ ├── subset.merged_LSU.fasta.mseq.gz │ ├── subset.merged_LSU.fasta.mseq_hdf5.biom │ ├── subset.merged_LSU.fasta.mseq_json.biom │ ├── subset.merged_LSU.fasta.mseq.tsv │ └── subset.merged_LSU.fasta.mseq.txt └── SSU ├── krona.html ├── subset.merged_SSU.fasta.mseq.gz ├── subset.merged_SSU.fasta.mseq_hdf5.biom ├── subset.merged_SSU.fasta.mseq_json.biom ├── subset.merged_SSU.fasta.mseq.tsv └── subset.merged_SSU.fasta.mseq.txt

9 directories, 78 files

jprmachado commented 1 year ago

This PR does not affect the workflow, it just add new data to test. I will merge it directly.