Input file format requirements

markozecevic commented 2 years ago

Hi!

I am trying to reformat my data to fit GWAS_Flow input format requirements. If possible, it would be very helpful if these input requirements could be documented in a little bit more detail.

I have managed to run GWAS_Flow with CSV input containing a subset of variants.

If however the genotype data is provided in HDF5 format - it seems that the sample IDs need to be integers, like in the example data. It didn’t immediately work with my data:

2022-01-12T17:51:30.473267359Z Traceback (most recent call last):
2022-01-12T17:51:30.473280368Z   File "/home/ubuntu/gwas.py", line 86, in <module>
2022-01-12T17:51:30.473295853Z     X_FILE, Y_FILE, K_FILE, M_PHE, COF_FILE)
2022-01-12T17:51:30.473299820Z   File "/home/ubuntu/gwas_flow/main.py", line 29, in load_and_prepare_data
2022-01-12T17:51:30.473304968Z     acc_x = np.asarray(snp['accessions'][:], dtype=np.int)
2022-01-12T17:51:30.473308795Z   File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
2022-01-12T17:51:30.473312730Z     return array(a, dtype, copy=False, order=order)
2022-01-12T17:51:30.473317253Z ValueError: invalid literal for int() with base 10: b'COV49'

…but I can get past this if I remove the “COV” prefix from sample names. I guess this could be a standard or something that some VCF/Plink->HDF5 (if it exists) conversion tool enforces. Any info on how you create these files would be appreciated! I was using the rhdf5 R library to create mine, while trying to mirror the structure of your example files. But it seems there is always another piece missing.

I have also noticed that I cannot “mix and match” - if genotype data is provided as a HDF5 - the kinship matrix needs to be as well. Out of curiosity - what was used to construct the example kinship matrix? I am not used to seeing such high coefficients everywhere, but I am no expert. _EDIT: Sorry, I was looking at /test_data/K_test.csv instead of /gwas_sample_data/Ksample.csv.

Finally, with the Plink format for the genotype (and CSVs for phenotype and kinship), I get this error:

Mapping files:   0%|          | 0/3 [00:00<?, ?it/s]
Mapping files:  33%|███▎      | 1/3 [00:55<01:50, 55.27s/it]
Mapping files:  67%|██████▋   | 2/3 [00:57<00:24, 24.16s/it]
Mapping files: 100%|██████████| 3/3 [01:16<00:00, 21.84s/it]
Mapping files: 100%|██████████| 3/3 [01:16<00:00, 25.58s/it]
/home/ubuntu/gwas_flow/main.py:195: RuntimeWarning: divide by zero encountered in double_scalars
  * kin_vr) * kin_vr
Traceback (most recent call last):
  File "/home/ubuntu/gwas.py", line 94, in <module>
    output = main.gwas(X, K, Y_, BATCH_SIZE, COF)
  File "/home/ubuntu/gwas_flow/main.py", line 281, in gwas
    v_g, delta, v_e = get_herit(y_phe, k_stand)
  File "/home/ubuntu/gwas_flow/main.py", line 200, in get_herit
    return estimate_variance_components(y_phe, "normal", k_stand, verbose=False)
  File "/home/ubuntu/gwas_flow/herit.py", line 24, in estimate_variance_components
    q_s = economic_qs(kin)
  File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/numpy_sugar/linalg/qs.py", line 24, in economic_qs
    nok = abs(max(Q[0].min(), Q[0].max(), key=abs)) < epsilon
IndexError: index 0 is out of bounds for axis 0 with size 0

Does this look familiar? Thank you!

iimog commented 2 years ago

The specific error happens when trying to standardize the kinship matrix so it seems there is a problem with that. I can't tell if it is a problem with the values in there or with loading it in the first place. Regarding the input format I fully agree, it is not documented well enough and we rely on some particularities that might not even be standard. GWAS_Flow will be superseded by a new program soon and we'll provide better compliance with standards and documentation there. I understand that this is not a satisfactory solution for your current problem so if you can't wait for the replacement I can try to help you get GWAS_Flow to work with your data. In that case: can you provide example files which look like your data and produce the same error message?

markozecevic commented 2 years ago

Thank you Markus! I would really appreciate any help with this as long as it doesn't take too much of your time. I've uploaded a subset of my data here and can confirm that it does indeed produce the same error message. I thought that maybe having a sparse kinship matrix could be the cause, seeing that gwas_sample_data/K_sample.csv does not have any zeroes, but adding a small value to every matrix element did not solve the issue. My kinship matrix was produced using PC-Relate method from the GENESIS package.

I am running GWAS-Flow as a CWL tool in a Dockerized environment built using your Dockerfile with just the entrypoint removed. The comand line:

python /home/ubuntu/gwas.py --genotype subset/subset.plink --phenotype nih_pheno_severe.csv --kinship kinship.csv --perm 100

iimog commented 2 years ago

The problem does not seems to be the kinship matrix but rather matching the IDs of the genotype to the phenotype. When I run with your data I get these warnings (do you have the latest master of GWAS_Flow?):

python gwas.py --genotype gwas_flow_example_data/subset/subset.plink --phenotype gwas_flow_example_data/nih_pheno_severe.csv --kinship gwas_flow_example_data/kinship.csv --perm 100
parsed commandline arguments
Mapping files: 100%|█████████████████████████████████████████████| 3/3 [00:00<00:00,  4.91it/s]
WARNING: accessions in X do not overlap with accessions in Y
Accessions X:
['0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
 '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
 '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
 '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
 '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
 '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
 '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0' '0'
 '0']
Accessions Y:
['COV100' 'COV101' 'COV102' 'COV103' 'COV104' 'COV105' 'COV106' 'COV108'
 'COV109' 'COV110' 'COV111' 'COV112' 'COV113' 'COV114' 'COV115' 'COV117'
 'COV118' 'COV119' 'COV121' 'COV122' 'COV124' 'COV125' 'COV126' 'COV127'
 'COV128' 'COV140' 'COV141' 'COV142' 'COV143' 'COV147' 'COV148' 'COV149'
 'COV153' 'COV154' 'COV155' 'COV156' 'COV157' 'COV158' 'COV160' 'COV161'
 'COV162' 'COV165' 'COV166' 'COV167' 'COV168' 'COV169' 'COV170' 'COV171'
 'COV172' 'COV173' 'COV174' 'COV175' 'COV176' 'COV178' 'COV179' 'COV180'
 'COV182' 'COV183' 'COV184' 'COV185' 'COV186' 'COV187' 'COV188' 'COV189'
 'COV190' 'COV191' 'COV192' 'COV193' 'COV194' 'COV196' 'COV197' 'COV198'
 'COV199' 'COV201' 'COV202' 'COV204' 'COV205' 'COV206' 'COV207' 'COV49'
 'COV50' 'COV51' 'COV52' 'COV55' 'COV56' 'COV57' 'COV58' 'COV59' 'COV61'
 'COV62' 'COV64' 'COV66' 'COV67' 'COV68' 'COV69' 'COV70' 'COV71' 'COV72'
 'COV73' 'COV74' 'COV75' 'COV76' 'COV77' 'COV78' 'COV79' 'COV80' 'COV81'
 'COV82' 'COV83' 'COV85' 'COV86' 'COV87' 'COV88' 'COV89' 'COV90' 'COV91'
 'COV92' 'COV93' 'COV94' 'COV95' 'COV96' 'COV97' 'COV98' 'COV99' 'K236'
 'K8' 'K89']
WARNING: not all accessions are in the kinship matrix
Accessions X/Y:
[]
Accessions K:
Index(['COV100', 'COV101', 'COV102', 'COV103', 'COV108', 'COV113', 'COV114',
       'COV115', 'COV117', 'COV118',
       ...
       'COV51', 'COV61', 'COV62', 'COV66', 'COV71', 'COV73', 'COV87', 'COV92',
       'COV95', 'COV96'],
      dtype='object', length=127)
data has been imported

So because X and Y have no overlapping accessions this set is already empty and can not be matched with the Kinship matrix either. This problem can be solved by using the same IDs as in the second column of the subset.fam also in the first columns (currently these are all 0s).

Unfortunately, this leads to another error which I'm still struggling to debug. For some reason there happen to be a different number of markers in the input and the output. I'll continue trying to find the problem tomorrow.

markozecevic commented 2 years ago

Thank you Markus, FIDs were indeed empty and I missed this warning as by mistake I only saved Stderr in my logs.

Is this the same error that you are now getting?

2022-02-17T11:32:14.609129994Z Traceback (most recent call last):
2022-02-17T11:32:14.609142416Z   File "/home/ubuntu/gwas.py", line 115, in <module>
2022-02-17T11:32:14.609146718Z     'eff_size': output[:, 1]})
2022-02-17T11:32:14.609150026Z   File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in __init__
2022-02-17T11:32:14.609153458Z     mgr = init_dict(data, index, columns, dtype=dtype)
2022-02-17T11:32:14.609156839Z   File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 254, in init_dict
2022-02-17T11:32:14.609160209Z     return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
2022-02-17T11:32:14.609163477Z   File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 64, in arrays_to_mgr
2022-02-17T11:32:14.609166887Z     index = extract_index(arrays)
2022-02-17T11:32:14.609170040Z   File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 365, in extract_index
2022-02-17T11:32:14.609173475Z     raise ValueError("arrays must all be same length")
2022-02-17T11:32:14.609176777Z ValueError: arrays must all be same length

iimog commented 2 years ago

It is :slightly_smiling_face: and I found the cause of the problem. One marker (X:68204280:C:T) is included twice in the subset.bim file (l. 472455 and l. 472456). As the Marker is used as an (presumably) unique identifier it caused a mismatch in the number of rows of the result table. After removing the duplicate line it works for me :smiley:

markozecevic commented 2 years ago

Thanks again and sorry - the entire thing ended up being due to faulty input :/

iimog commented 2 years ago

No worries! Happy to help! And honestly the sanity checks and error messages of GWAS_Flow could be better.

Joyvalley / GWAS_Flow

Input file format requirements #33