EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

Some studies without harmonised files in 36635386 #1420

Closed eks-ebi closed 2 weeks ago

eks-ebi commented 2 months ago

This issue was raised by a user query.

Some studies in this publication have harmonised files, e.g. http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90200001-GCST90201000/GCST90200266/

But others do not, e.g. http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90200001-GCST90201000/GCST90200267/

I can't see any obvious reason why one would be harmonised and the other not, so maybe there was an error in the harmonisation process?

Action items:

jiyue1214 commented 1 month ago

Hello, it is true that some studies failed in the harmonization pipeline and the results did not show up on the FTP. However, this file is not such a case.

In another scenario, our pipeline runs 1600 studies per day using 4 jobs (400 studies per job). This particular publication is a large paper containing over 4000 studies, which were divided into at least 10 jobs to run. The reason you see some studies harmonized while others are not is because one of them may have been interrupted and did not succeed.

I have a list of studies with files that need to be fixed and cannot finish the pipeline. The study you listed is not on my list.

What i going to do: requeue them to be harmonisation pipeline.

jiyue1214 commented 1 month ago

For those studies that failed the harmonisation without any reason, I re-queue them to be re-harmonised.

jiyue1214 commented 1 month ago

For the PMID_36635386 there are a total of 4443 studies and 4140 already harmonised and 303 have not been harmonised until 24th Sep.

One study GCST90200809 cannot be harmonised because Error: column length (11) does not match header length (10). GCST90202482 need some invalidate cells, need to look in detail

Others 301 failed because of the time limit issue in the qualitycontrolqc. I will change the wall time for this step and rerun them.

jiyue1214 commented 1 month ago

164/309 already successfully harmonised, There are still 139 studies are waiting in the queue to be harmonised.

Started a job to run the those 139 studies specifically. job tag: pre_gwas_ssf_20240930_PMID_36635386_on_20240930-19-19

jiyue1214 commented 1 month ago

59 studies harmonised with no problem. It is a noticeable issue that the log file movement started before the log generation finished.

Example: GCST90200876.running.log was created by 2024-10-01 01:45:08.082020079 +0100 however, this job moving files to FTP started from 2024-10-01 01:44:36.

Theoretically, it should not happen, because we specifically set the output of the previous step as the input of movement to prevent this situation. I will investigate more here.

jiyue1214 commented 4 weeks ago

All studies of this publication are already harmonised with a full set of harmonised data. PMID_36635386_harmonised_result_check.txt

This ticket can be closed.