ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

Flat-file summaries to help find big-money hits #157

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

Comparing nt and a.a. summaries is necessary for interpreting the protein search -- the big money hits will be Covs found by protein but not found by nt, and to identify these I need all the nt summaries.

Can Tantalus generate a single tsv summary file for all datasets run so far, and post a daily update? For each line every summary file, add the SRA accession (required) and a run identifier, e.g. zoonotic_20xxxx (optional). For the FASTA section, collapse defline+sequence into one tabbed line (optional) or discard if not easy to include.

This is not easy for me to do myself for a couple of reasons. The S3 bucket is hard to access (e.g. the number of paths to summary/ directories evolves in an unpredictable way), and I have a stupid problem with very large directories in my external storage, with net result that I'm way behind in my local copy of the nt summaries.