FDA-ARGOS / data.argosdb

MIT License
3 stars 7 forks source link

Mazumder - Generate ngsQC_ncbi table from FDA BioProject SRA data #144

Closed steph-sing closed 1 year ago

steph-sing commented 1 year ago

You have already done this through this sheet https://data.argosdb.org/ARGOS_000009 - what you can do is update the BCO language (title, usability domain, etc.), file name, and IO Domain) + update the file as necessary with the Data dictionary (though I believe it already conforms to the non-core data dictionary). If you want to update it per the core data dictionary, let me know and we can plan those updates.

steph-sing commented 1 year ago

Was not complete before V1.41 data push for Dec 2022.

HadleyKing commented 1 year ago

@rajamazumder and @steph-sing

Below is a table of the values for ngsQC and either a Y or N for weather I am able to get this from NCBI using my previous programs. They have changed the way the site is accessed and it is no longer possible to get all of the analysis data they had before.

These two files are what I am able to get:

organism_name Y
infraspecific_name N
lineage Y
genome_assembly_id N
taxonomy_id Y
bco_id Y
schema_version Y
analysis_platform Y
analysis_platform_object_id Y
bioproject Y
biosample Y
strain Y
sra_run_id Y
ngs_read_file_name Y
ngs_read_file_source Y
ngs_gc_content N
avg_phred_score N
avg_read_length Y
max_read_length N
min_read_length N
num_reads_unique N
pos_outlier_count N
codon_table N
percent_coding N
percent_not_coding N
density_n_per_read N
complexity_percent N
non_complex_percent N
avg_quality_a N
avg_quality_t N
avg_quality_g N
avg_quality_c N
count_a N
count_t N
count_g N
count_c N
instrument N
id_method N
wgs_accession N
strategy N
ngs_score N
HadleyKing commented 1 year ago

See NCBI Trace database to be retired in June 2022

and

The wait is over… NIH’s Public Sequence Read Archive is now open access on the cloud

steph-sing commented 1 year ago

@HadleyKing @rajamazumder To me, I don't see any issue or blocker with this table. Potentially adding or changing a couple terms in the Data dictionary is needed for items outlined in the third list, but I don't see these as blockers. Explanations below:

The following would not be pulled from NCBI regardless - these are manual inputs or are generated in HIVE: Manual: bco_id | Y schema_version | Y analysis_platform | Y analysis_platform_object_id | Y ngs_read_file_source | Y ngs_score | N

HIVE: ngs_gc_content | N avg_phred_score | N num_reads_unique | N pos_outlier_count | N codon_table | N percent_coding | N percent_not_coding | N density_n_per_read | N complexity_percent | N non_complex_percent | N avg_quality_a | N avg_quality_t | N avg_quality_g | N avg_quality_c | N count_a | N count_t | N count_g | N count_c | N max_read_length | N min_read_length | N

Items listed here can easily be derived from NCBI, and the attached file has additional terms and values you could consider adding to your table, using your example Org: Example Values: SraRunInfo.csv Of Note: Aside from genome_assembly_id and infraspecific_name, the remaining N values are all outlined in the Text file your provided:

organism_name | Y infraspecific_name | N lineage | Y genome_assembly_id | N taxonomy_id | Y bioproject | Y biosample | Y strain | Y sra_run_id | Y ngs_read_file_name | Y avg_read_length | Y instrument | N id_method | N wgs_accession | N strategy | N

You may need to dig a bit deeper to get the following info from NCBI, but it exists: ngs_gc_content genome_assembly_id

steph-sing commented 1 year ago

Orgs for NCBI/EBI Email:

organism_name: Salmonella enterica LT2 genome_assembly_id: GCA_001558355.2 taxonomy_id: 28901 biosample: SAMN03996249 sra_run_id: SRR2814419

organism_name: Severe acute respiratory syndrome coronavirus 2 Wuhan-Hu-1 genome_assembly_id: GCA_009858895.3 taxonomy_id: 2697049 biosample: SAMN13922059 SRR10971381 sra_run_id: SRR10971381

organism_name: Influenza A virus A/Puerto Rico/8/1934(H1N1) genome_assembly_id: GCA_000865725.1 taxonomy_id: 211044 biosample: SAMEA51847918 sra_run_id: ERR2096902

steph-sing commented 1 year ago

Notes: Download metadata associated with SRA data From the search result page SRA Run files do not contain any information about the metadata (sample information, etc.) linked to the data themselves. To download metadata for each Run in your Entrez query click Send to on the top of the page, check the File radio button, and select RunInfo in pull-down menu. This will generate a tabular SraRunInfo.csv file with metadata available for each Run.

HadleyKing commented 1 year ago

ngsQC_NCBI.tsv

steph-sing commented 1 year ago

@HadleyKing I'm not able to download this file either. Can you please fix this issue? Otherwise you can put both the ngs and assembly datasets into this folder in the Dev server? Tag me and I can review them that way. Thanks

/data/shared/argosdb/downloads/review

steph-sing commented 1 year ago

@HadleyKing received. will review shortly and get back to you

HadleyKing commented 1 year ago

updated https://biocomputeobject.org/builder/https/biocomputeobject.org/ARGOS_000009/DRAFT

HadleyKing commented 1 year ago
steph-sing commented 1 year ago

Completed as part of 1.42 data release.