Open RosaDeSa opened 1 year ago
Hi @RosaDeSa šš¼ were you able to figure out what the issue was? If so, it could be helpful for others if you share your solution. I'm unsure how ezancestry handles VEP annotations, the parser from snps might be robust enough to handle them though.
Hi @arvkevi , I obtained the prediction.csv file and plotted it. The problem was probably due to a malformed file; I generated again the VCF file adding some parameters in VEP. Despite this, I'm still determining the results, I used two different VCFs (from two different samples), but the prediction results are exactly the same; this is probably a little weird. I'll try snsp, as you suggested. If I find consistent results, I'll gladly share the solution here! Thanx
Ezancestry uses snps to read vcfs in process.py. Are the two samples related? Do they have the exact same set of AISNPs?
I noticed it, also using snps I've same results. The samples are not related, they belong two different person. And yes, they have the same AISNPs, it's weird, isn't?
In a while I'll analyze wgs of other 2 different samples, I'll test also on those the script.
#pca,kidd,/home/r.desantis/.ezancestry/data/models,/home/r.desantis/.ezancestry/data/aisnps
,component1,component2,component3,predicted_population_population,ACB,ASW,BEB,CDX,CEU,CHB,CHS,CLM,ESN,FIN,GBR,GIH,GWD,IBS,ITU,JPT,KHV,LWK,MSL,MXL,PEL,PJL,PUR,STU,TSI,YRI,predicted_population_superpopulation,AFR,AMR,EAS,EUR,SAS,population_description,superpopulation_name
LV_vep.vcf,0.11874386857468588,0.15300045809781831,0.3265148978535419,ITU,0.0,0.0,0.08919748915377203,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09703463769218275,0.0,0.0,0.29927578644262454,0.0,0.0,0.0,0.0,0.0,0.0,0.08710151819096609,0.22274821473011025,0.20464235379034443,0.0,0.0,SAS,0.0,0.17202243612400409,0.0,0.0,0.827977563875996,Indian Telugu in the UK,South Asian Ancestry
#pca,kidd,/home/r.desantis/.ezancestry/data/models,/home/r.desantis/.ezancestry/data/aisnps
,component1,component2,component3,predicted_population_population,ACB,ASW,BEB,CDX,CEU,CHB,CHS,CLM,ESN,FIN,GBR,GIH,GWD,IBS,ITU,JPT,KHV,LWK,MSL,MXL,PEL,PJL,PUR,STU,TSI,YRI,predicted_population_superpopulation,AFR,AMR,EAS,EUR,SAS,population_description,superpopulation_name
out.vcf,0.11874386857468588,0.15300045809781831,0.3265148978535419,ITU,0.0,0.0,0.08919748915377203,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09703463769218275,0.0,0.0,0.29927578644262454,0.0,0.0,0.0,0.0,0.0,0.0,0.08710151819096609,0.22274821473011025,0.20464235379034443,0.0,0.0,SAS,0.0,0.17202243612400409,0.0,0.0,0.827977563875996,Indian Telugu in the UK,South Asian Ancestry
Hi @arvkevi also with other 2 samples I've same problem.
Following head of vcf with SNPs that I give in input. Is that correct for Ezancestry?
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a2
chr1 13813 . T G 67.64 MQ_filter AC=1;AF=0.500;AN=2;BaseQRankSum=-1.645;DP=5;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=24.33;MQRankSum=-1.282;QD=13.53;ReadPosRankSum=1.036;SOR=1.609 GT:AD:DP:FT:GQ:PL 0/1:3,2:5:DP_filter:75:75,0,120
chr1 13838 rs200683566 C T 64.64 MQ_filter AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DB;DP=6;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=25.17;MQRankSum=-1.501;QD=10.77;ReadPosRankSum=0.431;SOR=1.179 GT:AD:DP:FT:GQ:PL 0/1:4,2:6:DP_filter:72:72,0,142
chr1 13868 . A G 32.65 MQ_filter AC=1;AF=0.500;AN=2;BaseQRankSum=-0.967;DP=3;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=26.87;MQRankSum=0.967;QD=10.88;ReadPosRankSum=0.967;SOR=0.223 GT:AD:DP:FT:GQ:PL 0/1:1,2:3:DP_filter:18:40,0,18
chr1 16288 rs200736374 C G 42.64 QD_filter AC=1;AF=0.500;AN=2;BaseQRankSum=1.889;DB;DP=36;ExcessHet=0.0000;FS=1.817;MLEAC=1;MLEAF=0.500;MQ=42.58;MQRankSum=-2.014;QD=1.22;ReadPosRankSum=1.022;SOR=0.939 GT:AD:DP:GQ:PL 0/1:30,5:35:50:50,0,968
chr1 16298 rs200451305 C T 311.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=1.497;DB;DP=30;ExcessHet=0.0000;FS=3.682;MLEAC=1;MLEAF=0.500;MQ=42.47;MQRankSum=-4.337;QD=12.47;ReadPosRankSum=2.029;SOR=1.388 GT:AD:DP:GQ:PL 0/1:13,12:25:99:319,0,385
chr1 16378 rs148220436 T C 293.64 MQ_filter AC=1;AF=0.500;AN=2;BaseQRankSum=-2.461;DB;DP=38;ExcessHet=0.0000;FS=5.153;MLEAC=1;MLEAF=0.500;MQ=36.39;MQRankSum=-3.036;QD=8.16;ReadPosRankSum=-0.747;SOR=1.190 GT:AD:DP:GQ:PL 0/1:22,14:36:99:301,0,599
Hey @RosaDeSa, one other thing that could be contributing to this is having too many missing AISNPs in the vcf. When you call predict, it should log a message indicating how many AISNPs were present in your vcf for a sample. It looks like this (from cell 23 of this notebook).
2021-09-20 06:25:34.289 | INFO | ezancestry.process:_input_to_dataframe:276 - Sample has a valid genotype for 44
out of a possible 55 (80.0%)
Do you know how many AISNPs were in your input samples?
Yes, you're right! I've 0 of out of possible 55 using the Kidd set and 1 of 127 using the Seldin set. Do you think the problem is the reference I used to align the data (hg38)? Prediction searches the aisnps for rs id and not for position, right?
Hmm, the merge is on both rsid AND position. Unfortunately, this requires vcf annotated with rsids and for the position to match the hg19 positions from the .aisnps files.
You could try commenting out "chr" and "position_hg19" in this line, but I haven't looked at the hg19->hg38 liftover in about a year. So if you do this, you should see if any alleles changed.
I'll have to think about how ezancestry could support hg38. The easiest would probably be a --hg38 flag that uses new versions of the aisnps files. But I won't have time to get to this work for a little while.
Hi Kevin , I'm trying this script but I'm running into this error during the prediction: (the vcf file was annotated with VEP)
DEBUG | ezancestry.process:process_user_input:214 - list index out of range Traceback (most recent call last): File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/process.py", line 217, in process_user_input snpsdf = pd.read_csv( File "/usr/local/lib/python3.9/dist-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 678, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 581, in _read return parser.read(nrows) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 1253, in read index, columns, col_dict = self._engine.read(nrows) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 270, in read alldata = self._rows_to_cols(content) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 1013, in _rows_to_cols self._alert_malformed(msg, row_num + 1) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 739, in _alert_malformed raise ParserError(msg) pandas.errors.ParserError: Expected 3 fields in line 7, saw 4
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/tigem/r.desantis/.local/bin/ezancestry", line 8, in
sys.exit(app())
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/typer/main.py", line 214, in call
return get_command(self)(*args, kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 829, in call
return self.main(args, kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(args, kwargs)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/typer/main.py", line 532, in wrapper
return callback(**use_params) # type: ignore
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/commands.py", line 286, in predict
snpsdf = process_user_input(input_data, aisnps_directory, aisnps_set)
File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/process.py", line 232, in process_user_input
raise ValueError(
ValueError: a1.VEP.ann.vcf is not a valid file or directory. Please provide a valid file or directory.