CINECA-project / wp4-federated-joint-cohort-analysis

6 stars 0 forks source link

Host eQTL dataset into virtual cohort at CSC #5

Open shukapoo opened 4 years ago

shukapoo commented 4 years ago

Virtual Cohort at CSC to be tentatively ready by 27.04.2020

shukapoo commented 4 years ago

Virtual Cohort at CSC is in a state to host data (400GB). Waiting for exact datasets to be hosted, @kauralasoo requested to suggest some datasets.

kauralasoo commented 4 years ago

Thanks Shubham,

This is great! What type of data can you host in there? Would you be able to host both RNA sequencing (fastq) and genotype (vcf) data? I think one good option would be to host the Finnish subset of the GEUVADIS dataset together with the corresponding 1000 Genomes genotypes. I can provide you with the relevant ENA sample ids and and also subset the genotype dataset.

Melanie Courtot from WP3 was also interested in using some data from the HipSci project, because it has really good metadata and some of the samples are also open access. I will coordinate with her to see what would be the best way forward.

Best, Kaur

shukapoo commented 4 years ago

Hi @kauralasoo , I just got confirmation from our FEGA data team, both fastq & vcf files should be supported. The only practicality is available space which is limited to 400 GB for our use.

Regards, Shubham

kauralasoo commented 4 years ago

Hi @shukapoo , I added a .tsv file with eQTL sample metadata to this repository (https://github.com/CINECA-project/wp4-federated-joint-cohort-analysis/blob/master/virtual_cohort/GEUVADIS_YRI_samples.tsv). The sample_id column contains the ENA sample id that you can use to download the fastq files from ENA and add them to the virtual cohort. The ENA project is here: https://www.ebi.ac.uk/ena/data/view/PRJEB3366 These files are 362G, so just below the 400 GB limit.

I will prepare the genotype data (vcf file) separately and let you know when it's ready. It will only be a few gigabytes.

kauralasoo commented 4 years ago

Hi @shukapoo @lvarin , can you updated me on the status of virtual cohort? Are these fastq files accessible on the kubernetes cluster that is running TESK?

lvarin commented 4 years ago

Hello,

I will have to consult @shukapoo and my other colleagues to be sure, but I do not think they are available. I will see if I can upload these files tomorrow.

Regards

kauralasoo commented 4 years ago

Thanks,

No need to upload them ad hoc - I thought that if they are already in the local EGA virtual cohort than I could try to test if I could access them that way.

Kaur

lvarin commented 4 years ago

Hello,

I just got the information that the local EGA does not support metadata, so it is not possible.

Regards

kauralasoo commented 4 years ago

Just to clarify, does this mean that it's not possible to access the data from local EGA or does it mean that has to be copied manually to the kubernetes cluster?