averagehat / pathos

0 stars 1 forks source link

Dataset mocking #1

Open averagehat opened 5 years ago

averagehat commented 5 years ago

@pirekupcode

We need mock host databases for 1. STAR, 2. bowtie2 and 3. NT: these require a lot of memory to build (1 & 2, when human) and take up too much disk space (3)

The recommended way to subset NT is through the blastdbcmd which comes with blast (on the cluster)

We could use mosquito genomes for the host databases, and build them beforehand or during install

averagehat commented 4 years ago

I went ahead and implemented this in #4 so that I could create an integration test for the py3 change.

A next step would be generating data from the database reads using something like the following:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5224698/

biogrinder/grinder (on sourceforge)

dwgsim is an option

Spike-in existing data: