USF-HII / snptk

USF HII SNP Toolkit - Analyze and translate SNP entries using NCBI dbSNP and related databases
GNU General Public License v3.0
0 stars 1 forks source link

Parse Rsmerge json file to extract information on merged snps for GRCh37 and GRCh38 #8

Closed j2moreno closed 4 years ago

j2moreno commented 4 years ago

https://ftp.ncbi.nih.gov/snp/latest_release/JSON/refsnp-merged.json.bz2

j2moreno commented 4 years ago

Fields extracting in json to get merged snps:

j2moreno commented 4 years ago

Some snpids have no merged_into field and are not included when creating Rsmerge flat file:


2020-03-16 15:41:48 svc-3024-5-8.rc.usf.edu DEBUG(1): rs748938867 in file 12 has no merge info!                                        
{'citations': [],                                                  
 'create_date': '2015-04-1T22:25Z',                                
 'dbsnp1_merges': [],                                              
 'last_update_build_id': '152',                                    
 'last_update_date': '2018-10-12T18:51Z',                          
 'lost_obs_movements': [{'allele_in_cur_release': {'deleted_sequence': 'AAAT',                                                         
                                                   'inserted_sequence': 'AAAT',                                                        
                                                   'position': 21541580,                                                               
                                                   'seq_id': 'NC_000014.9'},                                                           
                         'allele_in_prev_release': {'deleted_sequence': 'AAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAA',         
                                                    'inserted_sequence': 'AAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAA',        
                                                    'position': 22009718,                                                              
                                                    'seq_id': 'NC_000014.8'},                                                          
                         'component_ids': [{'type': 'subsnp',      
                                            'value': '1710625764'},
                                           {'type': 'subsnp',      
                                            'value': '1710625767'}],                                                                   
                         'observation': {'deleted_sequence': 'AAAT',                                                                   
                                         'inserted_sequence': 'AAAT',                                                                  
                                         'position': 22009726,                                                                         
                                         'seq_id': 'NC_000014.8'}, 
                         'rsids_in_cur_release': ['71419142']},                                                                        
                        {'allele_in_cur_release': {'deleted_sequence': 'AAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAA',              
                                                   'inserted_sequence': 'AAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAA',             
                                                   'position': 21541576,                                                               
                                                   'seq_id': 'NC_000014.9'},                                                           
                         'allele_in_prev_release': {'deleted_sequence': 'AAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAA',         
                                                    'inserted_sequence': 'AAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAA',            
                                                    'position': 22009718,                                                              
                                                    'seq_id': 'NC_000014.8'},                                                          
                         'component_ids': [{'type': 'subsnp',                                                                          
                                            'value': '1710625764'},
                                           {'type': 'subsnp',                                                                          
                                            'value': '1710625767'}],                                                                   
                         'observation': {'deleted_sequence': 'AAAT',                                                                   
                                         'inserted_sequence': '',  
                                         'position': 22009726,                                                                         
                                         'seq_id': 'NC_000014.8'},                                                                     
                         'rsids_in_cur_release': ['71419142']}],                                                                       
 'merged_snapshot_data': {'merged_into': [],                       
                          'proxy_build_id': '152',                                                                                     
                          'proxy_time': '2018-10-12T18:51Z'},      
 'present_obs_movements': [],                                      
 'refsnp_id': '748938867'}                                   ```
j2moreno commented 4 years ago

https://github.com/USF-HII/snptk/commit/51d4f35c8b0fb0e51cac6309dddf90546ab945e8

j2moreno commented 4 years ago

Because of the size of Rsmerge, https://ftp.ncbi.nih.gov/snp/latest_release/JSON/refsnp-merged.json.bz2 was split into 32 files using snptk-split before processing using snptk-parse-rsmerge-json.py

j2moreno commented 4 years ago

https://github.com/USF-HII/snptk/commit/1c4fc6f2a66cffafe2f9c31bca6c6a88ec58a0b3

j2moreno commented 4 years ago

https://github.com/USF-HII/snptk/commit/ddc7956ee35b7ff805b7f439d489fae273536222