Closed j2moreno closed 4 years ago
Inside of dbsnp json build 153, there seems to be Multiple Nucleotide Variations (MNVs). When parsing through this snps, they do not have positions available for them. For now I ignore them and do not include them when I transform dbsnp json to a flat file.
2020-03-16 14:17:00 svc-3024-5-10.rc.usf.edu DEBUG(1): rs1554922975 on chr11 file 31 does not have a position available
2020-03-16 14:06:25 svc-3024-5-10.rc.usf.edu DEBUG(1): rs1554747070 on chr10 file 31 does not have a position available
Fields needed so that snptk can probably read file:
Json parsing fields:
Due to size, each bz2 file https://ftp.ncbi.nih.gov/snp/latest_release/JSON/ was split into 32 parts using snptk-split before processing using snptk-parse-dbsnp-json.py
Will also be extracting orientation information to be consistent with dbsnp build 151 file.
primary_snapshot_data': {'allele_annotations': [{'assembly_annotation': [{'annotation_release': 'Homo '
'sapiens '
'Annotation '
'Release '
'109',
'genes': [{'id': 729296,
'is_pseudo': False,
'locus': 'LOC729296',
'name': 'uncharacterized '
'LOC729296',
'orientation': 'plus',
Some snps do not have orientation information. For these cases we will output '-'.
{"refsnp_id":"171425","create_date":"2000-07-12T13:47Z","last_update_date":"2019-07-14T00:08Z","last_update_build_id":"153","dbsnp1_merges":[{"merged_rsid":"171426","revision":"137","merge_date":"2012-05-4T12:46Z"},{"merged_rsid":"57051557","revision":"130","merge_date":"2008-05-23T16:19Z"}],"citations":[],"lost_obs_movements":[],"present_obs_movements":[{"component_ids":[{"type":"subsnp","value":"81368782"}],"observation":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"G"},"allele_in_cur_release":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"G"},"other_rsids_in_cur_release":[],"previous_release":{"allele":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"G"},"rsids":["171425"]},"last_added_to_this_rs":"137"},{"component_ids":[{"type":"subsnp","value":"81368782"}],"observation":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"T"},"allele_in_cur_release":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"T"},"other_rsids_in_cur_release":[],"previous_release":{"allele":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"T"},"rsids":["171425"]},"last_added_to_this_rs":"137"}],"primary_snapshot_data":{"placements_with_allele":[{"seq_id":"NC_000017.9","is_ptlp":true,"placement_annot":{"seq_type":"refseq_chromosome","mol_type":"genomic","seq_id_traits_by_assembly":[],"is_aln_opposite_orientation":false,"is_mismatch":false},"alleles":[{"allele":{"spdi":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"G"}},"hgvs":"NC_000017.9:g.41374944="},{"allele":{"spdi":{"seq_id":"NC_000017.9","position":41374943,"deleted_sequence":"G","inserted_sequence":"T"}},"hgvs":"NC_000017.9:g.41374944G>T"}]}],"allele_annotations":[{"frequency":[],"clinical":[],"submissions":["81368782"],"assembly_annotation":[]},{"frequency":[],"clinical":[],"submissions":["81368782"],"assembly_annotation":[]}],"support":[{"id":{"type":"subsnp","value":"ss81368782"},"revision_added":"137","create_date":"2007-12-14T18:29Z","submitter_handle":"HGSV"}],"anchor":"NC_000017.9:0041374943:1:snv","variant_type":"snv"}}
https://ftp.ncbi.nih.gov/snp/latest_release/JSON/