BreakerLab / dimpl

DIMPL: Discovery of Intergenic Motifs PipeLine
MIT License
3 stars 3 forks source link

Incorporate environmental databases into search #4

Open snmalk opened 4 years ago

snmalk commented 4 years ago

Potential environmental databases we could pull from. (Linked API page where I could)

snmalk commented 4 years ago

@kenibrewer what types of steps did you and Glenn take to incorporate refseq from NCBI? Did you download the database or access it over an API? What kind of formatting/steps would we need to make it compatible with dimpl?

kenibrewer commented 4 years ago

@ggaffield Is actually the expert here on what files are needed. He created the script src/shell/gff2bed.py which has the convert function that can convert an NCBI-style GFF file to the .bed files that we use for generating the Genome Context images.

In terms of the steps he took, he wrote a separate script that downloaded all of the GFF files from a Refseq version from NCBI, separated out all of the IGRs (+50 nucleotides on either side) and concatenated them into a huge fasta file called s50.igr.fasta. His script aslo searches the complete database of Rfam motifs against the genomes to find any that NCBI hasn't annotated yet. The data about Rfam hits for each genome gets placed in a folder in Farnam called /ysm-gpfs/pi/breaker/data/dimpl/refseq98/features/*_rfam.bed. Currently dimpl is generating Genome context images on the fly by requesting the relevant data from NCBI and then generating the necessary bed files (without the extra Rfam data). The development version of dimpl that @mp2452 and I are working on will speed up this process significantly by requesting these preprocessed bed files from Farnam. This also gives us a process for incorporating data coming from locations other than NCBI.

Personally, I think MGnify is probably the best location for pulling a set of metagenomic sequences. They maintain a broadly non-redundant set of protein sequences as a search database that can be downloaded. If we can figure out how to extract that IGRs from the same selection of contigs they use for building their non-redundant protein data set I think we would be in a really good place.

ggaffield commented 4 years ago

Ken's got it right.

As a shortcut for searching, I can easily tack on the env12 sequences to refseq98. However, they would not have any identified features when displaying context. That would take a quite a bit of work. Either extracting the features from the Bliss database somehow, or just re-blasting them. Plus, running an RFam search on them as well for knowns.