Closed markozecevic closed 2 years ago
The specific error happens when trying to standardize the kinship matrix so it seems there is a problem with that. I can't tell if it is a problem with the values in there or with loading it in the first place. Regarding the input format I fully agree, it is not documented well enough and we rely on some particularities that might not even be standard. GWAS_Flow
will be superseded by a new program soon and we'll provide better compliance with standards and documentation there. I understand that this is not a satisfactory solution for your current problem so if you can't wait for the replacement I can try to help you get GWAS_Flow
to work with your data. In that case: can you provide example files which look like your data and produce the same error message?
Thank you Markus! I would really appreciate any help with this as long as it doesn't take too much of your time. I've uploaded a subset of my data here and can confirm that it does indeed produce the same error message. I thought that maybe having a sparse kinship matrix could be the cause, seeing that gwas_sample_data/K_sample.csv does not have any zeroes, but adding a small value to every matrix element did not solve the issue. My kinship matrix was produced using PC-Relate method from the GENESIS package.
I am running GWAS-Flow as a CWL tool in a Dockerized environment built using your Dockerfile with just the entrypoint removed. The comand line:
python /home/ubuntu/gwas.py --genotype subset/subset.plink --phenotype nih_pheno_severe.csv --kinship kinship.csv --perm 100
The problem does not seems to be the kinship matrix but rather matching the IDs of the genotype to the phenotype. When I run with your data I get these warnings (do you have the latest master
of GWAS_Flow?):
python gwas.py --genotype gwas_flow_example_data/subset/subset.plink --phenotype gwas_flow_example_data/nih_pheno_severe.csv --kinship gwas_flow_example_data/kinship.csv --perm 100
parsed commandline arguments
Mapping files: 100%|█████████████████████████████████████████████| 3/3 [00:00<00:00, 4.91it/s]
WARNING: accessions in X do not overlap with accessions in Y
Accessions X:
['0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
'0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
'0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
'0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
'0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
'0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
'0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
'0']
Accessions Y:
['COV100' 'COV101' 'COV102' 'COV103' 'COV104' 'COV105' 'COV106' 'COV108'
'COV109' 'COV110' 'COV111' 'COV112' 'COV113' 'COV114' 'COV115' 'COV117'
'COV118' 'COV119' 'COV121' 'COV122' 'COV124' 'COV125' 'COV126' 'COV127'
'COV128' 'COV140' 'COV141' 'COV142' 'COV143' 'COV147' 'COV148' 'COV149'
'COV153' 'COV154' 'COV155' 'COV156' 'COV157' 'COV158' 'COV160' 'COV161'
'COV162' 'COV165' 'COV166' 'COV167' 'COV168' 'COV169' 'COV170' 'COV171'
'COV172' 'COV173' 'COV174' 'COV175' 'COV176' 'COV178' 'COV179' 'COV180'
'COV182' 'COV183' 'COV184' 'COV185' 'COV186' 'COV187' 'COV188' 'COV189'
'COV190' 'COV191' 'COV192' 'COV193' 'COV194' 'COV196' 'COV197' 'COV198'
'COV199' 'COV201' 'COV202' 'COV204' 'COV205' 'COV206' 'COV207' 'COV49'
'COV50' 'COV51' 'COV52' 'COV55' 'COV56' 'COV57' 'COV58' 'COV59' 'COV61'
'COV62' 'COV64' 'COV66' 'COV67' 'COV68' 'COV69' 'COV70' 'COV71' 'COV72'
'COV73' 'COV74' 'COV75' 'COV76' 'COV77' 'COV78' 'COV79' 'COV80' 'COV81'
'COV82' 'COV83' 'COV85' 'COV86' 'COV87' 'COV88' 'COV89' 'COV90' 'COV91'
'COV92' 'COV93' 'COV94' 'COV95' 'COV96' 'COV97' 'COV98' 'COV99' 'K236'
'K8' 'K89']
WARNING: not all accessions are in the kinship matrix
Accessions X/Y:
[]
Accessions K:
Index(['COV100', 'COV101', 'COV102', 'COV103', 'COV108', 'COV113', 'COV114',
'COV115', 'COV117', 'COV118',
...
'COV51', 'COV61', 'COV62', 'COV66', 'COV71', 'COV73', 'COV87', 'COV92',
'COV95', 'COV96'],
dtype='object', length=127)
data has been imported
So because X and Y have no overlapping accessions this set is already empty and can not be matched with the Kinship matrix either. This problem can be solved by using the same IDs as in the second column of the subset.fam
also in the first columns (currently these are all 0s).
Unfortunately, this leads to another error which I'm still struggling to debug. For some reason there happen to be a different number of markers in the input and the output. I'll continue trying to find the problem tomorrow.
Thank you Markus, FIDs were indeed empty and I missed this warning as by mistake I only saved Stderr in my logs.
Is this the same error that you are now getting?
2022-02-17T11:32:14.609129994Z Traceback (most recent call last):
2022-02-17T11:32:14.609142416Z File "/home/ubuntu/gwas.py", line 115, in <module>
2022-02-17T11:32:14.609146718Z 'eff_size': output[:, 1]})
2022-02-17T11:32:14.609150026Z File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in __init__
2022-02-17T11:32:14.609153458Z mgr = init_dict(data, index, columns, dtype=dtype)
2022-02-17T11:32:14.609156839Z File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 254, in init_dict
2022-02-17T11:32:14.609160209Z return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
2022-02-17T11:32:14.609163477Z File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 64, in arrays_to_mgr
2022-02-17T11:32:14.609166887Z index = extract_index(arrays)
2022-02-17T11:32:14.609170040Z File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 365, in extract_index
2022-02-17T11:32:14.609173475Z raise ValueError("arrays must all be same length")
2022-02-17T11:32:14.609176777Z ValueError: arrays must all be same length
It is :slightly_smiling_face: and I found the cause of the problem. One marker (X:68204280:C:T
) is included twice in the subset.bim
file (l. 472455 and l. 472456). As the Marker is used as an (presumably) unique identifier it caused a mismatch in the number of rows of the result table. After removing the duplicate line it works for me :smiley:
Thanks again and sorry - the entire thing ended up being due to faulty input :/
No worries! Happy to help!
And honestly the sanity checks and error messages of GWAS_Flow
could be better.
Hi!
I am trying to reformat my data to fit GWAS_Flow input format requirements. If possible, it would be very helpful if these input requirements could be documented in a little bit more detail.
I have managed to run GWAS_Flow with CSV input containing a subset of variants.
If however the genotype data is provided in HDF5 format - it seems that the sample IDs need to be integers, like in the example data. It didn’t immediately work with my data:
…but I can get past this if I remove the “COV” prefix from sample names. I guess this could be a standard or something that some VCF/Plink->HDF5 (if it exists) conversion tool enforces. Any info on how you create these files would be appreciated! I was using the rhdf5 R library to create mine, while trying to mirror the structure of your example files. But it seems there is always another piece missing.
I have also noticed that I cannot “mix and match” - if genotype data is provided as a HDF5 - the kinship matrix needs to be as well. Out of curiosity - what was used to construct the example kinship matrix? I am not used to seeing such high coefficients everywhere, but I am no expert. _EDIT: Sorry, I was looking at /test_data/K_test.csv instead of /gwas_sample_data/Ksample.csv.
Finally, with the Plink format for the genotype (and CSVs for phenotype and kinship), I get this error:
Does this look familiar? Thank you!