Closed rcedgar closed 3 years ago
IonTorrent might have elevated # of indels => frameshifts...
I'm automatically parsing the SRA runinfo CSV's that Artem gave me, but many accessions were from the STAT list, hence not Serratus, thus we don't have metadata for them
@rchikhi Can you fix this by downloading the SRA runinfo files that are missing? IMHO the master table should be fixed rather than me doing a post-processing hack...
I can try tomorrow but I don't know an immediate way to download runinfo from a massive list of arbitrary accessions
You can do it in Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez) as follows: make a text file with the accessions, browse to the file, and click Retrieve
. This will send you to the SRA web page with the answer to the query. Select Send To
, format RunInfo
. There is surely an equivalent way to do this from the command-line using the NCBI command-line utilities epost, efetch, something roughly like this (exact command line arguments left as exercise for the reader):
cat SRAs.txt | epost -db sra | efetch -format runinfo
excellent, thanks!
Unfortunately you can't really use EDirect from the command line for extracting SRA data using standard unixy tools, as it is all returned as XML. You'll need to use some command-line XML parser. What I recommend is to use BioPython.
Actually, there might be some overlap between this, the queries that we might want to do for co-assembly, and my need for pulling out host information. Kill three birds with one code?
Aside from 'platform', are there any other attributes that you need @rcedgar ?
@rcedgar, also help me understand which accessions are of interest; just the ones with platform '?' ?
@rchikhi I'll generate the report, and you'll do the patching into the master-of-the-universe file. Tell me which columns I'll need to extract beyond the SRA accession to allow you to join the tables. Those birds won't know what hit 'em.
Ah! great, thanks Tomer. I only need the 'Run' and 'Platform' fields from the sra metadata..
How about this:
@tomer make a tsv file from the runinfo's for all the assemblies. (IMO csv is a totally brain-damaged format because commas often appear in text while tabs rarely do, see csvformat -h
).
@rchikhi remove runinfo fields from the master table, stick to what you get from the assemblies. Maybe switch to tsv format :-)
Keep two separate tsvs, this is fine, no need to join everything IMO, that can be tricky.
Yeah, I'm with you @rcedgar. I'm a tabby myself.
Which SRA accessions, all of the ones in the current version of the master file?
Using the current master file makes sense. If we add new assemblies for some reason you can use the same script to add new entries incrementally.
I initially had a TSV for the master table, but then I also had empty fields which messed up column display..
Here's what I have so far: s3://serratus-taltman/scratch/sra_disaster_table.tsv
Also, attaching for convenience.
Columns: SRA accession, BioProject, Platform, Instrument Model, sample taxonID, sample scientific name
Note: Entrez queries are SRX-oriented, not SRA-oriented, so effectively my queries pulled down all SRA metadata for all Experiments where one or more of the experiment's SRA runs were part of the query. So this file has more rows than the sra_master_table.csv file.
Also, some SRA metadata caused errors, as are flagged with the second column having a value of "Error processing SRA run". I'll investigate time-permitting.
Please let me know if there are other attributes that might be desired.
As you can tell, the sample taxon and scientific name field is a mess. A semantic mix of "where the sample was obtained" and "what we find inside the sample".
@taltman SRA table look great, thanks! You make a good point about the semantic mix, I hadn't given that enough consideration. Getting a good set of assembly-host associations is going to be a fair amount of work. I'll open a new issue proposing a strategy. Your table solves the I-T filtering so I'll close this issue.
Actually, let's keep this open until we have worked out a pipeline for integrating this info into the 'master' table.
Conceptually, "data about the SRA according to the SRA" and "data about one assembly derived from the assembly" are not the same. Therefore IMO fine, actually better, to keep these as separate tables -- suggest declare victory and close issue.
@rchikhi Here is where you can find the sra_metadata.py
script. Look at the docstring at the top for an example of how to use with the sra_master_table.csv file.
https://github.com/ababaian/serratus/tree/taltman-dev/src/summarizer
I submitted a pull request, so hopefully it will not be in an obscure branch in the future.
Reassigning to you. Feel free to close the issue if you think this is good enough, or you want to integrate into the sra_master_table.csv generation code.
Thanks for taking this off my plate @taltman! very useful script.
Regarding the bikeshedding debate of putting this info in the master table or not, well, I've already integrated the `sra_disaster_table.tsv' yesterday before I saw this debate, so probably won't undo the work.
SRR11840026 is an Ion Torrent dataset which gave a novel OTU which is most likely a false positive due to Ion sequencing.
The master assembly table has "?" for the SRR11840026 "platform" field, same for more than half the assemblies. Can this be fixed to give the correct platform? Do I need to go back to the SRA runinfo's ... hope not!