arvkevi / ezancestry

Easy genetic ancestry predictions in Python
https://ezancestry.streamlit.app
MIT License
56 stars 11 forks source link

ValueError: vcf is not a valid file or directory. Please provide a valid file or directory. #71

Open RosaDeSa opened 1 year ago

RosaDeSa commented 1 year ago

Hi Kevin , I'm trying this script but I'm running into this error during the prediction: (the vcf file was annotated with VEP)

DEBUG | ezancestry.process:process_user_input:214 - list index out of range Traceback (most recent call last): File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/process.py", line 217, in process_user_input snpsdf = pd.read_csv( File "/usr/local/lib/python3.9/dist-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 678, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 581, in _read return parser.read(nrows) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 1253, in read index, columns, col_dict = self._engine.read(nrows) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 270, in read alldata = self._rows_to_cols(content) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 1013, in _rows_to_cols self._alert_malformed(msg, row_num + 1) File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/python_parser.py", line 739, in _alert_malformed raise ParserError(msg) pandas.errors.ParserError: Expected 3 fields in line 7, saw 4

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/tigem/r.desantis/.local/bin/ezancestry", line 8, in sys.exit(app()) File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/typer/main.py", line 214, in call return get_command(self)(*args, kwargs) File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/click/core.py", line 610, in invoke return callback(args, kwargs) File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/typer/main.py", line 532, in wrapper return callback(**use_params) # type: ignore File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/commands.py", line 286, in predict snpsdf = process_user_input(input_data, aisnps_directory, aisnps_set) File "/home/tigem/r.desantis/.local/lib/python3.9/site-packages/ezancestry/process.py", line 232, in process_user_input raise ValueError( ValueError: a1.VEP.ann.vcf is not a valid file or directory. Please provide a valid file or directory.

arvkevi commented 1 year ago

Hi @RosaDeSa šŸ‘‹šŸ¼ were you able to figure out what the issue was? If so, it could be helpful for others if you share your solution. I'm unsure how ezancestry handles VEP annotations, the parser from snps might be robust enough to handle them though.

RosaDeSa commented 1 year ago

Hi @arvkevi , I obtained the prediction.csv file and plotted it. The problem was probably due to a malformed file; I generated again the VCF file adding some parameters in VEP. Despite this, I'm still determining the results, I used two different VCFs (from two different samples), but the prediction results are exactly the same; this is probably a little weird. I'll try snsp, as you suggested. If I find consistent results, I'll gladly share the solution here! Thanx

arvkevi commented 1 year ago

Ezancestry uses snps to read vcfs in process.py. Are the two samples related? Do they have the exact same set of AISNPs?

RosaDeSa commented 1 year ago

I noticed it, also using snps I've same results. The samples are not related, they belong two different person. And yes, they have the same AISNPs, it's weird, isn't?

In a while I'll analyze wgs of other 2 different samples, I'll test also on those the script.

#pca,kidd,/home/r.desantis/.ezancestry/data/models,/home/r.desantis/.ezancestry/data/aisnps
,component1,component2,component3,predicted_population_population,ACB,ASW,BEB,CDX,CEU,CHB,CHS,CLM,ESN,FIN,GBR,GIH,GWD,IBS,ITU,JPT,KHV,LWK,MSL,MXL,PEL,PJL,PUR,STU,TSI,YRI,predicted_population_superpopulation,AFR,AMR,EAS,EUR,SAS,population_description,superpopulation_name
LV_vep.vcf,0.11874386857468588,0.15300045809781831,0.3265148978535419,ITU,0.0,0.0,0.08919748915377203,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09703463769218275,0.0,0.0,0.29927578644262454,0.0,0.0,0.0,0.0,0.0,0.0,0.08710151819096609,0.22274821473011025,0.20464235379034443,0.0,0.0,SAS,0.0,0.17202243612400409,0.0,0.0,0.827977563875996,Indian Telugu in the UK,South Asian Ancestry

#pca,kidd,/home/r.desantis/.ezancestry/data/models,/home/r.desantis/.ezancestry/data/aisnps
,component1,component2,component3,predicted_population_population,ACB,ASW,BEB,CDX,CEU,CHB,CHS,CLM,ESN,FIN,GBR,GIH,GWD,IBS,ITU,JPT,KHV,LWK,MSL,MXL,PEL,PJL,PUR,STU,TSI,YRI,predicted_population_superpopulation,AFR,AMR,EAS,EUR,SAS,population_description,superpopulation_name
out.vcf,0.11874386857468588,0.15300045809781831,0.3265148978535419,ITU,0.0,0.0,0.08919748915377203,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09703463769218275,0.0,0.0,0.29927578644262454,0.0,0.0,0.0,0.0,0.0,0.0,0.08710151819096609,0.22274821473011025,0.20464235379034443,0.0,0.0,SAS,0.0,0.17202243612400409,0.0,0.0,0.827977563875996,Indian Telugu in the UK,South Asian Ancestry
RosaDeSa commented 1 year ago

Hi @arvkevi also with other 2 samples I've same problem.

Following head of vcf with SNPs that I give in input. Is that correct for Ezancestry?

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  a2
chr1    13813   .       T       G       67.64   MQ_filter       AC=1;AF=0.500;AN=2;BaseQRankSum=-1.645;DP=5;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=24.33;MQRankSum=-1.282;QD=13.53;ReadPosRankSum=1.036;SOR=1.609     GT:AD:DP:FT:GQ:PL       0/1:3,2:5:DP_filter:75:75,0,120
chr1    13838   rs200683566     C       T       64.64   MQ_filter       AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DB;DP=6;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=25.17;MQRankSum=-1.501;QD=10.77;ReadPosRankSum=0.431;SOR=1.179   GT:AD:DP:FT:GQ:PL       0/1:4,2:6:DP_filter:72:72,0,142
chr1    13868   .       A       G       32.65   MQ_filter       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.967;DP=3;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=26.87;MQRankSum=0.967;QD=10.88;ReadPosRankSum=0.967;SOR=0.223      GT:AD:DP:FT:GQ:PL       0/1:1,2:3:DP_filter:18:40,0,18
chr1    16288   rs200736374     C       G       42.64   QD_filter       AC=1;AF=0.500;AN=2;BaseQRankSum=1.889;DB;DP=36;ExcessHet=0.0000;FS=1.817;MLEAC=1;MLEAF=0.500;MQ=42.58;MQRankSum=-2.014;QD=1.22;ReadPosRankSum=1.022;SOR=0.939   GT:AD:DP:GQ:PL  0/1:30,5:35:50:50,0,968
chr1    16298   rs200451305     C       T       311.64  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=1.497;DB;DP=30;ExcessHet=0.0000;FS=3.682;MLEAC=1;MLEAF=0.500;MQ=42.47;MQRankSum=-4.337;QD=12.47;ReadPosRankSum=2.029;SOR=1.388  GT:AD:DP:GQ:PL  0/1:13,12:25:99:319,0,385
chr1    16378   rs148220436     T       C       293.64  MQ_filter       AC=1;AF=0.500;AN=2;BaseQRankSum=-2.461;DB;DP=38;ExcessHet=0.0000;FS=5.153;MLEAC=1;MLEAF=0.500;MQ=36.39;MQRankSum=-3.036;QD=8.16;ReadPosRankSum=-0.747;SOR=1.190 GT:AD:DP:GQ:PL  0/1:22,14:36:99:301,0,599
arvkevi commented 1 year ago

Hey @RosaDeSa, one other thing that could be contributing to this is having too many missing AISNPs in the vcf. When you call predict, it should log a message indicating how many AISNPs were present in your vcf for a sample. It looks like this (from cell 23 of this notebook).

2021-09-20 06:25:34.289 | INFO     | ezancestry.process:_input_to_dataframe:276 - Sample has a valid genotype for 44 
out of a possible 55 (80.0%)

Do you know how many AISNPs were in your input samples?

RosaDeSa commented 1 year ago

Yes, you're right! I've 0 of out of possible 55 using the Kidd set and 1 of 127 using the Seldin set. Do you think the problem is the reference I used to align the data (hg38)? Prediction searches the aisnps for rs id and not for position, right?

arvkevi commented 1 year ago

Hmm, the merge is on both rsid AND position. Unfortunately, this requires vcf annotated with rsids and for the position to match the hg19 positions from the .aisnps files.

You could try commenting out "chr" and "position_hg19" in this line, but I haven't looked at the hg19->hg38 liftover in about a year. So if you do this, you should see if any alleles changed.

I'll have to think about how ezancestry could support hg38. The easiest would probably be a --hg38 flag that uses new versions of the aisnps files. But I won't have time to get to this work for a little while.