Closed steph-sing closed 1 year ago
Was not complete before V1.41 data push for Dec 2022.
@rajamazumder and @steph-sing
Below is a table of the values for ngsQC
and either a Y or N for weather I am able to get this from NCBI using my previous programs. They have changed the way the site is accessed and it is no longer possible to get all of the analysis data they had before.
These two files are what I am able to get:
organism_name | Y |
---|---|
infraspecific_name | N |
lineage | Y |
genome_assembly_id | N |
taxonomy_id | Y |
bco_id | Y |
schema_version | Y |
analysis_platform | Y |
analysis_platform_object_id | Y |
bioproject | Y |
biosample | Y |
strain | Y |
sra_run_id | Y |
ngs_read_file_name | Y |
ngs_read_file_source | Y |
ngs_gc_content | N |
avg_phred_score | N |
avg_read_length | Y |
max_read_length | N |
min_read_length | N |
num_reads_unique | N |
pos_outlier_count | N |
codon_table | N |
percent_coding | N |
percent_not_coding | N |
density_n_per_read | N |
complexity_percent | N |
non_complex_percent | N |
avg_quality_a | N |
avg_quality_t | N |
avg_quality_g | N |
avg_quality_c | N |
count_a | N |
count_t | N |
count_g | N |
count_c | N |
instrument | N |
id_method | N |
wgs_accession | N |
strategy | N |
ngs_score | N |
@HadleyKing @rajamazumder To me, I don't see any issue or blocker with this table. Potentially adding or changing a couple terms in the Data dictionary is needed for items outlined in the third list, but I don't see these as blockers. Explanations below:
The following would not be pulled from NCBI regardless - these are manual inputs or are generated in HIVE: Manual: bco_id | Y schema_version | Y analysis_platform | Y analysis_platform_object_id | Y ngs_read_file_source | Y ngs_score | N
HIVE: ngs_gc_content | N avg_phred_score | N num_reads_unique | N pos_outlier_count | N codon_table | N percent_coding | N percent_not_coding | N density_n_per_read | N complexity_percent | N non_complex_percent | N avg_quality_a | N avg_quality_t | N avg_quality_g | N avg_quality_c | N count_a | N count_t | N count_g | N count_c | N max_read_length | N min_read_length | N
Items listed here can easily be derived from NCBI, and the attached file has additional terms and values you could consider adding to your table, using your example Org: Example Values: SraRunInfo.csv Of Note: Aside from genome_assembly_id and infraspecific_name, the remaining N values are all outlined in the Text file your provided:
organism_name | Y infraspecific_name | N lineage | Y genome_assembly_id | N taxonomy_id | Y bioproject | Y biosample | Y strain | Y sra_run_id | Y ngs_read_file_name | Y avg_read_length | Y instrument | N id_method | N wgs_accession | N strategy | N
You may need to dig a bit deeper to get the following info from NCBI, but it exists: ngs_gc_content genome_assembly_id
Orgs for NCBI/EBI Email:
organism_name: Salmonella enterica LT2 genome_assembly_id: GCA_001558355.2 taxonomy_id: 28901 biosample: SAMN03996249 sra_run_id: SRR2814419
organism_name: Severe acute respiratory syndrome coronavirus 2 Wuhan-Hu-1 genome_assembly_id: GCA_009858895.3 taxonomy_id: 2697049 biosample: SAMN13922059 SRR10971381 sra_run_id: SRR10971381
organism_name: Influenza A virus A/Puerto Rico/8/1934(H1N1) genome_assembly_id: GCA_000865725.1 taxonomy_id: 211044 biosample: SAMEA51847918 sra_run_id: ERR2096902
Notes: Download metadata associated with SRA data From the search result page SRA Run files do not contain any information about the metadata (sample information, etc.) linked to the data themselves. To download metadata for each Run in your Entrez query click Send to on the top of the page, check the File radio button, and select RunInfo in pull-down menu. This will generate a tabular SraRunInfo.csv file with metadata available for each Run.
@HadleyKing I'm not able to download this file either. Can you please fix this issue? Otherwise you can put both the ngs and assembly datasets into this folder in the Dev server? Tag me and I can review them that way. Thanks
/data/shared/argosdb/downloads/review
@HadleyKing received. will review shortly and get back to you
v1.1
Completed as part of 1.42 data release.
You have already done this through this sheet https://data.argosdb.org/ARGOS_000009 - what you can do is update the BCO language (title, usability domain, etc.), file name, and IO Domain) + update the file as necessary with the Data dictionary (though I believe it already conforms to the non-core data dictionary). If you want to update it per the core data dictionary, let me know and we can plan those updates.