Comparing nt and a.a. summaries is necessary for interpreting the protein search -- the big money hits will be Covs found by protein but not found by nt, and to identify these I need all the nt summaries.
Can Tantalus generate a single tsv summary file for all datasets run so far, and post a daily update? For each line every summary file, add the SRA accession (required) and a run identifier, e.g. zoonotic_20xxxx (optional). For the FASTA section, collapse defline+sequence into one tabbed line (optional) or discard if not easy to include.
This is not easy for me to do myself for a couple of reasons. The S3 bucket is hard to access (e.g. the number of paths to summary/ directories evolves in an unpredictable way), and I have a stupid problem with very large directories in my external storage, with net result that I'm way behind in my local copy of the nt summaries.
Comparing nt and a.a. summaries is necessary for interpreting the protein search -- the big money hits will be Covs found by protein but not found by nt, and to identify these I need all the nt summaries.
Can Tantalus generate a single tsv summary file for all datasets run so far, and post a daily update? For each line every summary file, add the SRA accession (required) and a run identifier, e.g. zoonotic_20xxxx (optional). For the FASTA section, collapse defline+sequence into one tabbed line (optional) or discard if not easy to include.
This is not easy for me to do myself for a couple of reasons. The S3 bucket is hard to access (e.g. the number of paths to summary/ directories evolves in an unpredictable way), and I have a stupid problem with very large directories in my external storage, with net result that I'm way behind in my local copy of the nt summaries.