Cleanup the management commands

1001genomes / AraGWAS

GWAS catalogue for Arabidopsis thaliana

https://aragwas.1001genomes.org

MIT License

11 stars 5 forks source link

Cleanup the management commands #128

Open timeu opened 7 years ago

timeu commented 7 years ago

Currently there are a lot of mangement commands:

compute_n_hits
generate_complete_csv
import_phenotypes
import_publication_links
import_sample_number
index_study
setup_es
submit_to_datacite

Some of them were workarounds to get the data in. We should remove those. So far I think submit_to_datacite, setup_es, index_study, import_phenotypes are definitely required. Not sure about the others.

The import_phenotypes should have an option to update the phenotype information if they already exists.

mtog commented 7 years ago

The problem is that we don't have a single command to add a new study, how do we usually add them? I will fuse the instructions in compute_n_hits, import_publication_links and import_sample_number in one command and remove generate_complete_csv.

timeu commented 7 years ago

So I see it as follows: We should have a import_phenotypes command that we can run by hand or as a cronjob that will go to AraPheno fetch the data, insert new phenotypes. I wouldn't want to update the existing ones, because otherwise we need to re-index all the associations. Usually also the data on AraPheno doesn't get updated once they are published. This will make sure that we allways have the published AraPheno phenotypes also in AraGWAS. Eventually we should also have a cronjob that would run the GWAS pipeline for the new phenotypes (or if a new genotype is released for all the existing ones). But right now we will probably do this by hand. So as you pointed out we probably need an endpoint that would take an hdf5 file and create a GWAS study that is connected to the phenotype and index the associations.

mtog commented 7 years ago

Ok, I will delete the other commands and create a new one for new studies (as proposed in #31 ). However we base all the current pipeline on the fact that studies, phenotypes and hdf5 files always carry the same id, can we keep this assumption for the future? (i.e. will the file be named 289.hdf5?)

timeu commented 7 years ago

No we can't. This is purely a coincidance because we currently have a 1-1 mapping between phenotypes and GWAS studies (1 transformation, 1 method and 1 genotype). As soon as we introduce either a new method or a new genotype version this does not uphold. I would design the command that it takes the phenotype id, genotype id, method, transformation and a HDF5 file and creates a new GWAS study (id should be automatically assigned).