Issues with trioPhaser pathing/directories causing issues on cloud compute platforms

jon4thin commented 2 months ago

I am trying to the docker image on a cloud-based platform to run the pipeline but am running into problems because I am not running an interactive environment where I can create directories. Even the example doesnt work here. I tried various ways to just put everything in the entry point, since the cloud platform automatically mounts (whatever that means, I actually have 0 experience with docker):

python3 /trio_phaser.py -c /trioPhaser/validate/son_ashkenazim_GRCh38_chr22.g.vcf.gz -p /trioPhaser/validate/father_ashkenazim_GRCh38_chr22.g.vcf.gz -m /trioPhaser/validate/mother_ashkenazim_GRCh38_chr22.g.vcf.gz -o phased_output.vcf.gz -r /data > /data/trio_phaser.out and this: python3 /trio_phaser.py -c /trioPhaser/validate/son_ashkenazim_GRCh38_chr22.g.vcf.gz -p /trioPhaser/validate/father_ashkenazim_GRCh38_chr22.g.vcf.gz -m /trioPhaser/validate/mother_ashkenazim_GRCh38_chr22.g.vcf.gz -o phased_output.vcf.gz -r /data/ > /data/trio_phaser.out and this: python3 /trio_phaser.py -c /trioPhaser/validate/son_ashkenazim_GRCh38_chr22.g.vcf.gz -p /trioPhaser/validate/father_ashkenazim_GRCh38_chr22.g.vcf.gz -m /trioPhaser/validate/mother_ashkenazim_GRCh38_chr22.g.vcf.gz -o /data/phased_output.vcf.gz -r /data/ > /trio_phaser.out

and i get errors like:

[E::vcf_parse_format] Number of columns at 22:22325741 does not match the number of samples (1278 vs 2548)
index: "phased_output.vcf.gz" is in a format that cannot be usefully indexed
chmod: cannot access 'phased_output.vcf.gz.csi': No such file or directory

or, in the error log, i get this:

2024-08-21T04:07:39.243751659Z gzip: CG0003-6743.TrioPhaser.vcf: unexpected end of file
2024-08-21T04:07:40.019395911Z index: "CG0003-6743.TrioPhaser.vcf" is in a format that cannot be usefully indexed
2024-08-21T04:07:40.401640506Z chmod: cannot access 'CG0003-6743.TrioPhaser.vcf.csi': No such file or directory

but, at the same time, the standard out returned this:

Phased output file written as CG0003-6743.TrioPhaser.vcf
Outputfile written, compressed and indexed. Time elapsed: 0.25 minutes.

**********************************************************************
Done. Time elapsed: 209.41 minutes (3.49 hours) 
**********************************************************************`

jon4thin commented 2 months ago

The pathing is definitely weird. Is there any way to get the program to work without creating new directories?

dmiller903 commented 2 months ago

Can you share your series of steps? I don’t see steps where you’re actually using docker. The examples above imply to me you’re using Docker in interactive mode but you said you’re not using interactively. It looks like whatever you’re doing is outside of Docker.

jon4thin commented 2 months ago

yes, I am calling the docker image on a cloud compute platform called BioDataCatalyst which used Velsera's Seven Bridges Platform running on AWS EC2. Essentially, the platform goes into the docker image and mounts in its own directory that it sets as the home directory for the session you have - e.g. /sevenbridges/task_num_02/blar

the command the platform runs looks like this: first it starts the docker image, and then in the virtual session it runs:

python3 /trio_phaser.py --output_file CG0011-6062.trioPhaser.vcf  --haplotype_reference_files /data --child_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc04-NPCGC/CG0011-6062.g.vcf.gz --maternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc04-NPCGC/CG0011-6078.g.vcf.gz --paternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc04-NPCGC/CG0011-6618.g.vcf.gz --number_of_tasks 22 > /sbgenomics/workspaces/24d7cac4-8be1-4639-83b0-58f4c8afe108/tasks/d8de8bc7-9796-4511-8261-dbf1d0911ee8/triophaser/standard.out

This is a google drive link to a folder where I put in the standard out and the error log files for this run. The Instance configuration was: 1023.49 GB SSD · 36vCPUs · 58.95 GB RAM ; look at the metric graphs, no issues. I attached a screenshot below. The only odd thing is that for 2/3 of the run time, it runs on 2 CPU, it spikes to 22 for a short amount of time, and then slowly drops back down to 1 cpu until it ends.

The odd thing is that the first time I ran it with my real data, without trying to recover the outputs, it worked (the job.tree.log had the outputs and the error log was clean):

python3 /trio_phaser.py --output_file CG0003-6743.g.vcf.gz.phased.vcf --haplotype_reference_files /data --child_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/Stragglers_20250529/240_Yale_Read_Discrepant/CG0003-6743.g.vcf.gz --maternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc02-TOF/CG0003-6257.g.vcf.gz --paternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc02-TOF/CG0003-6248.g.vcf.gz --number_of_tasks 22 > /sbgenomics/workspaces/24d7cac4-8be1-4639-83b0-58f4c8afe108/tasks/fdbb10d9-38e2-498b-ba1f-3bd88e78ab4c/triophaser/output.txt

but when I asked the cloud platform to search for and recover the outputs, it ceased to work (like the job.tree.log only had the .tbi and the error log had the "unexpected end of file" error I described above):

python3 /trio_phaser.py --output_file CG0003-6743.phased.vcf --haplotype_reference_files /data --child_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/Stragglers_20250529/240_Yale_Read_Discrepant/CG0003-6743.g.vcf.gz --maternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc02-TOF/CG0003-6257.g.vcf.gz --paternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc02-TOF/CG0003-6248.g.vcf.gz --number_of_tasks 22 > /sbgenomics/workspaces/24d7cac4-8be1-4639-83b0-58f4c8afe108/tasks/7236c44e-15e4-46d0-b026-80068b8b0ab3/triophaser/output.txt

We then tried another way to capture the outputs and it worked when we downloaded and input the test vcfs stored in the package:

python3 /trio_phaser.py --output_file son_ashkenazim_GRCh38_chr22.g.vcf.gz.phased.vcf --haplotype_reference_files /data --child_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/son_ashkenazim_GRCh38_chr22.g.vcf.gz --maternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/mother_ashkenazim_GRCh38_chr22.g.vcf.gz --paternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/father_ashkenazim_GRCh38_chr22.g.vcf.gz --number_of_tasks 22 > /sbgenomics/workspaces/24d7cac4-8be1-4639-83b0-58f4c8afe108/tasks/26e61e19-f934-4713-ad8d-d5d614c8274e/triophaser/standard.out

But then this failed when I used my real data..... as you can see, that the actual commands are pretty much identical except for the file names, which leads me to believe there is some weird directory related issue associated with the haplotype reference and maybe where the outputs are being directed. This is because, when we tried the same command but set -r to ./, the platform picked up the shapeit2 reference files and, once again, only the .tbi for the output file:

python3 /trio_phaser.py --child_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/son_ashkenazim_GRCh38_chr22.g.vcf.gz --maternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/mother_ashkenazim_GRCh38_chr22.g.vcf.gz --paternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/father_ashkenazim_GRCh38_chr22.g.vcf.gz --haplotype_reference_files ./ --build_version 38 --number_of_tasks 10 --output_file ./son_ashkenazim_GRCh38_chr22.trioPhaser.vcf > /sbgenomics/workspaces/24d7cac4-8be1-4639-83b0-58f4c8afe108/tasks/e4710c4b-fff1-4bbd-a36c-09ee0fd3dc42/triophaser/strdout.txt

I am currently running more configurations to try to get it working, like setting the output to ~/

jon4thin commented 2 months ago

just fyi, I have run multiple developers' docker images on this platform and I have not had issues like this before.

jon4thin commented 2 months ago

Ok, I figured out a consistent fix. I just need to add append a ./ before the output file name and it works (and -r to /data ofc)

jon4thin commented 2 months ago

but please see if there is a cleaner way for the directories to be set up or something!

dmiller903 / trioPhaser

Issues with trioPhaser pathing/directories causing issues on cloud compute platforms #3