EBI-Metagenomics / genomes-catalogue-pipeline

MGnify genome analysis pipeline
Other
100 stars 21 forks source link

Post processing #10

Closed KateSakharova closed 2 years ago

KateSakharova commented 3 years ago

Fixes:

expecting structure:

MGYG...NUM
         --- genome
              --- fa
              --- fa.fai
              --- faa (main rep)
              --- gff (main rep)
              --- annotated gff (main rep)
              --- kegg, cog, cazy, ...
              --- IPS
              --- eggNOG
         --- pan-genome
              --- core_genes.txt
              --- <cluster>_mashtree.nwk
              --- pan_genome_reference.fa
              --- gene_presence_absence.Rtab
   MGYG...NUM
         --- genome
              --- fa
              --- fa.fai
              --- gff
              --- faa
              --- annotated gff (main rep)
              --- kegg, cog, cazy, ...
              --- IPS
              --- eggNOG

  intermediate_files/
         --- clusters_split.txt
         --- drep-filt-list.txt
         --- extra_weight_table.txt
         --- gunc_report_completed.txt
         --- names.tsv
         --- renamed_download.csv
         --- Sdb.csv
         --- mmseq.tsv
  gtdb-tk_output/ ( commented yet)
  rRNA_fastas/
  rRNA_outs/
  GFFs/
        --- gffs
        --- annotated gffs
  mmseqs_output/
        mmseqs_0.5_outdir.tar.gz
        mmseqs_0.95_outdir.tar.gz
        mmseqs_0.9_outdir.tar.gz
        mmseqs_1.0_outdir.tar.gz
  panaroo_output/
        MGYG.._panaroo.tar.gz
        ...
  dreplicated_genomes/                   (for GTDB-Tk)
KateSakharova commented 2 years ago

Pipeline tested to commit 6203a29 Output looks like:

test-post-pros-fixes
├── deperlicated_genomes
│   ├── MGYG000000002.fa
│   └── MGYG000000003.fa
├── GFF
│   ├── annotated_MGYG000000002.gff.gz
│   ├── annotated_MGYG000000003.gff.gz
│   ├── MGYG000000001.gff.gz
│   ├── MGYG000000002.gff.gz
│   └── MGYG000000003.gff.gz
├── intermediate_files
│   ├── clusters_split.txt
│   ├── drep-filt-list.txt
│   ├── extra_weight_table.txt
│   ├── gunc_report_completed.txt
│   ├── mmseqs_cluster.tsv
│   ├── names.tsv
│   ├── renamed_download.csv
│   └── Sdb.csv
├── MGYG000000002
│   └── genome
│       ├── annotated_MGYG000000002.gff
│       ├── MGYG000000002_annotation_coverage.tsv
│       ├── MGYG000000002_cazy_summary.tsv
│       ├── MGYG000000002_cog_summary.tsv
│       ├── MGYG000000002.fa
│       ├── MGYG000000002.faa
│       ├── MGYG000000002.fa.fai
│       ├── MGYG000000002.gff
│       ├── MGYG000000002_kegg_classes.tsv
│       └── MGYG000000002_kegg_modules.tsv
├── MGYG000000003
│   ├── genome
│   │   ├── annotated_MGYG000000003.gff
│   │   ├── MGYG000000003_annotation_coverage.tsv
│   │   ├── MGYG000000003_cazy_summary.tsv
│   │   ├── MGYG000000003_cog_summary.tsv
│   │   ├── MGYG000000003.fa
│   │   ├── MGYG000000003.faa
│   │   ├── MGYG000000003.fa.fai
│   │   ├── MGYG000000003.gff
│   │   ├── MGYG000000003_kegg_classes.tsv
│   │   └── MGYG000000003_kegg_modules.tsv
│   └── pan-genome
│       ├── core_genes.txt
│       ├── gene_presence_absence.Rtab
│       ├── MGYG000000003_mashtree.nwk
│       └── pan_genome_reference.fa
├── mmseqs_output
│   ├── mmseqs_0.5_outdir.tar.gz
│   ├── mmseqs_0.95_outdir.tar.gz
│   ├── mmseqs_0.9_outdir.tar.gz
│   └── mmseqs_1.0_outdir.tar.gz
├── panaroo_output
│   └── MGYG000000003_panaroo.tar.gz
├── rRNA_fastas
│   ├── MGYG000000001_fasta-results
│   │   └── MGYG000000001_rRNAs.fasta
│   ├── MGYG000000002_fasta-results
│   │   └── MGYG000000002_rRNAs.fasta
│   └── MGYG000000003_fasta-results
│       └── MGYG000000003_rRNAs.fasta
└── rRNA_outs
    ├── MGYG000000001_out-results
    │   ├── MGYG000000001_rRNAs.out
    │   └── MGYG000000001_tRNA_20aa.out
    ├── MGYG000000002_out-results
    │   ├── MGYG000000002_rRNAs.out
    │   └── MGYG000000002_tRNA_20aa.out
    └── MGYG000000003_out-results
        ├── MGYG000000003_rRNAs.out
        └── MGYG000000003_tRNA_20aa.out