ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Filter Ion Torrent (other?) from OTU table #218

Closed rcedgar closed 3 years ago

rcedgar commented 3 years ago

SRR11840026 is an Ion Torrent dataset which gave a novel OTU which is most likely a false positive due to Ion sequencing.

The master assembly table has "?" for the SRR11840026 "platform" field, same for more than half the assemblies. Can this be fixed to give the correct platform? Do I need to go back to the SRA runinfo's ... hope not!

asl commented 3 years ago

IonTorrent might have elevated # of indels => frameshifts...

rchikhi commented 3 years ago

I'm automatically parsing the SRA runinfo CSV's that Artem gave me, but many accessions were from the STAT list, hence not Serratus, thus we don't have metadata for them

rcedgar commented 3 years ago

@rchikhi Can you fix this by downloading the SRA runinfo files that are missing? IMHO the master table should be fixed rather than me doing a post-processing hack...

rchikhi commented 3 years ago

I can try tomorrow but I don't know an immediate way to download runinfo from a massive list of arbitrary accessions

rcedgar commented 3 years ago

You can do it in Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez) as follows: make a text file with the accessions, browse to the file, and click Retrieve. This will send you to the SRA web page with the answer to the query. Select Send To, format RunInfo. There is surely an equivalent way to do this from the command-line using the NCBI command-line utilities epost, efetch, something roughly like this (exact command line arguments left as exercise for the reader):

cat SRAs.txt | epost -db sra | efetch -format runinfo

rchikhi commented 3 years ago

excellent, thanks!

taltman commented 3 years ago

Unfortunately you can't really use EDirect from the command line for extracting SRA data using standard unixy tools, as it is all returned as XML. You'll need to use some command-line XML parser. What I recommend is to use BioPython.

Actually, there might be some overlap between this, the queries that we might want to do for co-assembly, and my need for pulling out host information. Kill three birds with one code?

Aside from 'platform', are there any other attributes that you need @rcedgar ?

taltman commented 3 years ago

@rcedgar, also help me understand which accessions are of interest; just the ones with platform '?' ?

@rchikhi I'll generate the report, and you'll do the patching into the master-of-the-universe file. Tell me which columns I'll need to extract beyond the SRA accession to allow you to join the tables. Those birds won't know what hit 'em.

rchikhi commented 3 years ago

Ah! great, thanks Tomer. I only need the 'Run' and 'Platform' fields from the sra metadata..

rcedgar commented 3 years ago

How about this:

@tomer make a tsv file from the runinfo's for all the assemblies. (IMO csv is a totally brain-damaged format because commas often appear in text while tabs rarely do, see csvformat -h).

@rchikhi remove runinfo fields from the master table, stick to what you get from the assemblies. Maybe switch to tsv format :-)

Keep two separate tsvs, this is fine, no need to join everything IMO, that can be tricky.

taltman commented 3 years ago

Yeah, I'm with you @rcedgar. I'm a tabby myself.

Which SRA accessions, all of the ones in the current version of the master file?

rcedgar commented 3 years ago

Using the current master file makes sense. If we add new assemblies for some reason you can use the same script to add new entries incrementally.

rchikhi commented 3 years ago

I initially had a TSV for the master table, but then I also had empty fields which messed up column display..

taltman commented 3 years ago

Here's what I have so far: s3://serratus-taltman/scratch/sra_disaster_table.tsv

Also, attaching for convenience.

Columns: SRA accession, BioProject, Platform, Instrument Model, sample taxonID, sample scientific name

Note: Entrez queries are SRX-oriented, not SRA-oriented, so effectively my queries pulled down all SRA metadata for all Experiments where one or more of the experiment's SRA runs were part of the query. So this file has more rows than the sra_master_table.csv file.

Also, some SRA metadata caused errors, as are flagged with the second column having a value of "Error processing SRA run". I'll investigate time-permitting.

sra_disaster_table.xlsx

Please let me know if there are other attributes that might be desired.

taltman commented 3 years ago

As you can tell, the sample taxon and scientific name field is a mess. A semantic mix of "where the sample was obtained" and "what we find inside the sample".

rcedgar commented 3 years ago

@taltman SRA table look great, thanks! You make a good point about the semantic mix, I hadn't given that enough consideration. Getting a good set of assembly-host associations is going to be a fair amount of work. I'll open a new issue proposing a strategy. Your table solves the I-T filtering so I'll close this issue.

taltman commented 3 years ago

Actually, let's keep this open until we have worked out a pipeline for integrating this info into the 'master' table.

rcedgar commented 3 years ago

Conceptually, "data about the SRA according to the SRA" and "data about one assembly derived from the assembly" are not the same. Therefore IMO fine, actually better, to keep these as separate tables -- suggest declare victory and close issue.

taltman commented 3 years ago

@rchikhi Here is where you can find the sra_metadata.py script. Look at the docstring at the top for an example of how to use with the sra_master_table.csv file.

https://github.com/ababaian/serratus/tree/taltman-dev/src/summarizer

I submitted a pull request, so hopefully it will not be in an obscure branch in the future.

Reassigning to you. Feel free to close the issue if you think this is good enough, or you want to integrate into the sra_master_table.csv generation code.

rchikhi commented 3 years ago

Thanks for taking this off my plate @taltman! very useful script.

Regarding the bikeshedding debate of putting this info in the master table or not, well, I've already integrated the `sra_disaster_table.tsv' yesterday before I saw this debate, so probably won't undo the work.