EBISPOT / gwas-sumstats-tools

Apache License 2.0
7 stars 1 forks source link

Migrate format functions from sum-stats-formatter to gwas-sumstat-tools #29

Closed jiyue1214 closed 4 months ago

jiyue1214 commented 6 months ago

sum-stats-formatter is the old formatter which can handle different input file to format into read-to-submission format to GWAS catalog. We want to integrate these functions from sum-stats-formatter to gwas-sumstat-tools. Modules and CLI should be available

jiyue1214 commented 5 months ago
  1. One change in the new formatter is the absence of split by left/right name. This is because petl does not support this split directly, and it was unclear how it would benefit users.
  2. Almost done, looking for some files to test.
ljwh2 commented 5 months ago

@eks-ebi and @earlEBI to look for complicated files to test with (multiple changes needed, including in same column)

jiyue1214 commented 5 months ago

In the gwas-sumstat-tools, Running the command gwas-ssf format can achieve three functions: Generate a configured file

Apply the configure file:

Batch submission

I tested one input file from Lizzy and it works great. I am testing it on other 7 real data.

jiyue1214 commented 5 months ago

Working on test the gwas-ssf format on 9 studies (for AMP). Need to discuss with Laura how many studies is suitable to test the script.

jiyue1214 commented 5 months ago

I tested the formatter on data from 4 publications for AMP (10 studies) and 5 studies from 5 publications from Lizzy, The current formatter can format files, but may missing some functions:

  1. Cannot handle complex delimiter, one of the common examples is multiple whitespace. It is caused by petl can only read 1-character separator. Solution: using skipinitialspace=True can fix multiplespace.
  2. Covert invalidate value: some cells could be invalidated (like -nan) and need to convert to #NA to pass the validation

Some other functions may be interested to think to integrate:

  1. Filling the chr pos if the file only has rsid (we need to know user dbSNP and some reference, I may not suggest it as one function to UI) - decide not to include this feature
jiyue1214 commented 4 months ago

Migration is finished and I improved the pipeline based on testing on the real data. This ticket can be closed.