USF-HII / snptk

USF HII SNP Toolkit - Analyze and translate SNP entries using NCBI dbSNP and related databases
GNU General Public License v3.0
0 stars 1 forks source link

Parse GRCh38 json files into flat file for digestion into snptk #6

Closed j2moreno closed 4 years ago

j2moreno commented 4 years ago

https://ftp.ncbi.nih.gov/snp/latest_release/JSON/

j2moreno commented 4 years ago

Inside of dbsnp json build 153, there seems to be Multiple Nucleotide Variations (MNVs). When parsing through this snps, they do not have positions available for them. For now I ignore them and do not include them when I transform dbsnp json to a flat file.

2020-03-16 14:17:00 svc-3024-5-10.rc.usf.edu DEBUG(1): rs1554922975 on chr11 file 31 does not have a position available
2020-03-16 14:06:25 svc-3024-5-10.rc.usf.edu DEBUG(1): rs1554747070 on chr10 file 31 does not have a position available
j2moreno commented 4 years ago

Fields needed so that snptk can probably read file:

Json parsing fields:

j2moreno commented 4 years ago

Due to size, each bz2 file https://ftp.ncbi.nih.gov/snp/latest_release/JSON/ was split into 32 parts using snptk-split before processing using snptk-parse-dbsnp-json.py

j2moreno commented 4 years ago

https://github.com/USF-HII/snptk/commit/044b6ee89d496587b15f2230bceef98da40caf0b

j2moreno commented 4 years ago

Will also be extracting orientation information to be consistent with dbsnp build 151 file.

primary_snapshot_data': {'allele_annotations': [{'assembly_annotation': [{'annotation_release': 'Homo '
                                                                                                  'sapiens '
                                                                                                  'Annotation '
                                                                                                  'Release '
                                                                                                  '109',
                                                                            'genes': [{'id': 729296,
                                                                                       'is_pseudo': False,
                                                                                       'locus': 'LOC729296',
                                                                                       'name': 'uncharacterized '
                                                                                               'LOC729296',
                                                                                       'orientation': 'plus',
j2moreno commented 4 years ago

Some snps do not have orientation information. For these cases we will output '-'.

{"refsnp_id":"171425","create_date":"2000-07-12T13:47Z","last_update_date":"2019-07-14T00:08Z","last_update_build_id":"153","dbsnp1_merges":[{"merged_rsid":"171426","revision":"137","merge_date":"2012-05-4T12:46Z"},{"merged_rsid":"57051557","revision":"130","merge_date":"2008-05-23T16:19Z"}],"citations":[],"lost_obs_movements":[],"present_obs_movements":[{"component_ids":[{"type":"subsnp","value":"81368782"}],"observation":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"G"},"allele_in_cur_release":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"G"},"other_rsids_in_cur_release":[],"previous_release":{"allele":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"G"},"rsids":["171425"]},"last_added_to_this_rs":"137"},{"component_ids":[{"type":"subsnp","value":"81368782"}],"observation":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"T"},"allele_in_cur_release":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"T"},"other_rsids_in_cur_release":[],"previous_release":{"allele":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"T"},"rsids":["171425"]},"last_added_to_this_rs":"137"}],"primary_snapshot_data":{"placements_with_allele":[{"seq_id":"NC_000017.9","is_ptlp":true,"placement_annot":{"seq_type":"refseq_chromosome","mol_type":"genomic","seq_id_traits_by_assembly":[],"is_aln_opposite_orientation":false,"is_mismatch":false},"alleles":[{"allele":{"spdi":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"G"}},"hgvs":"NC_000017.9:g.41374944="},{"allele":{"spdi":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"T"}},"hgvs":"NC_000017.9:g.41374944G>T"}]}],"allele_annotations":[{"frequency":[],"clinical":[],"submissions":["81368782"],"assembly_annotation":[]},{"frequency":[],"clinical":[],"submissions":["81368782"],"assembly_annotation":[]}],"support":[{"id":{"type":"subsnp","value":"ss81368782"},"revision_added":"137","create_date":"2007-12-14T18:29Z","submitter_handle":"HGSV"}],"anchor":"NC_000017.9:0041374943:1:snv","variant_type":"snv"}}
j2moreno commented 4 years ago

https://github.com/USF-HII/snptk/commit/ddc7956ee35b7ff805b7f439d489fae273536222