CFIA-NCFAD / wgscovplot

The Whole Genome Sequencing Coverage Plot (wgscovplot) is a tool to generate HTML Interactive Coverage Plot given coverage depth information, variants and DNA Gene features
Other
17 stars 6 forks source link

Automatic GFF and reference FASTA retrieval #26

Closed peterk87 closed 8 months ago

peterk87 commented 2 years ago

The user shouldn't have to provide a reference fasta/GenBank or any other files if the necessary files should already be present in the --inputdir:

For example, if SnpEff is being run, a SnpEff DB should be constructed with the following files:

.
├── snpeff.config
└── snpeff_db
    ├── genomes
    │   └── MN908947.3.fa
    └── MN908947.3
        ├── genes.gff
        └── snpEffectPredictor.bin

Alternatively, in the other files with the inputdir there might be information on what the reference is:

Medaka VCF file:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##medaka_version=1.4.4
##contig=<ID=MN908947.3>

Samtools idxstats file:

MN908947.3      29903   63      0
*       0       0       19373

Mosdepth BED file:

MN908947.3      0       895     0
MN908947.3      895     941     1
MN908947.3      941     944     0
MN908947.3      944     974     1
MN908947.3      974     975     0
MN908947.3      975     1006    1
MN908947.3      1006    1041    2

So it should be possible to infer that the reference accession is MN908947.3, look it up on NCBI with the Entrez API and retrieve it.

nhhaidee commented 2 years ago

Tool is able to retrieve reference fasta file and genbank file from NCBI automatically.

Adding retrieve these files from snpeff_db folder and handel GFF are in progress.

peterk87 commented 8 months ago

This should be implemented since 0.2.0 with only one get request per reference sequence since 1.0.0 (Genbank file fetched and parsed into FASTA and GFF).