Open jon4thin opened 2 months ago
The pathing is definitely weird. Is there any way to get the program to work without creating new directories?
Can you share your series of steps? I don’t see steps where you’re actually using docker. The examples above imply to me you’re using Docker in interactive mode but you said you’re not using interactively. It looks like whatever you’re doing is outside of Docker.
yes, I am calling the docker image on a cloud compute platform called BioDataCatalyst which used Velsera's Seven Bridges Platform running on AWS EC2. Essentially, the platform goes into the docker image and mounts in its own directory that it sets as the home directory for the session you have - e.g. /sevenbridges/task_num_02/blar
the command the platform runs looks like this: first it starts the docker image, and then in the virtual session it runs:
python3 /trio_phaser.py --output_file CG0011-6062.trioPhaser.vcf --haplotype_reference_files /data --child_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc04-NPCGC/CG0011-6062.g.vcf.gz --maternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc04-NPCGC/CG0011-6078.g.vcf.gz --paternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc04-NPCGC/CG0011-6618.g.vcf.gz --number_of_tasks 22 > /sbgenomics/workspaces/24d7cac4-8be1-4639-83b0-58f4c8afe108/tasks/d8de8bc7-9796-4511-8261-dbf1d0911ee8/triophaser/standard.out
This is a google drive link to a folder where I put in the standard out and the error log files for this run. The Instance configuration was: 1023.49 GB SSD · 36vCPUs · 58.95 GB RAM ; look at the metric graphs, no issues. I attached a screenshot below. The only odd thing is that for 2/3 of the run time, it runs on 2 CPU, it spikes to 22 for a short amount of time, and then slowly drops back down to 1 cpu until it ends.
The odd thing is that the first time I ran it with my real data, without trying to recover the outputs, it worked (the job.tree.log had the outputs and the error log was clean):
python3 /trio_phaser.py --output_file CG0003-6743.g.vcf.gz.phased.vcf --haplotype_reference_files /data --child_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/Stragglers_20250529/240_Yale_Read_Discrepant/CG0003-6743.g.vcf.gz --maternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc02-TOF/CG0003-6257.g.vcf.gz --paternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc02-TOF/CG0003-6248.g.vcf.gz --number_of_tasks 22 > /sbgenomics/workspaces/24d7cac4-8be1-4639-83b0-58f4c8afe108/tasks/fdbb10d9-38e2-498b-ba1f-3bd88e78ab4c/triophaser/output.txt
but when I asked the cloud platform to search for and recover the outputs, it ceased to work (like the job.tree.log only had the .tbi and the error log had the "unexpected end of file" error I described above):
python3 /trio_phaser.py --output_file CG0003-6743.phased.vcf --haplotype_reference_files /data --child_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/Stragglers_20250529/240_Yale_Read_Discrepant/CG0003-6743.g.vcf.gz --maternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc02-TOF/CG0003-6257.g.vcf.gz --paternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/pcgc_gvcfs/pcgc02-TOF/CG0003-6248.g.vcf.gz --number_of_tasks 22 > /sbgenomics/workspaces/24d7cac4-8be1-4639-83b0-58f4c8afe108/tasks/7236c44e-15e4-46d0-b026-80068b8b0ab3/triophaser/output.txt
We then tried another way to capture the outputs and it worked when we downloaded and input the test vcfs stored in the package:
python3 /trio_phaser.py --output_file son_ashkenazim_GRCh38_chr22.g.vcf.gz.phased.vcf --haplotype_reference_files /data --child_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/son_ashkenazim_GRCh38_chr22.g.vcf.gz --maternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/mother_ashkenazim_GRCh38_chr22.g.vcf.gz --paternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/father_ashkenazim_GRCh38_chr22.g.vcf.gz --number_of_tasks 22 > /sbgenomics/workspaces/24d7cac4-8be1-4639-83b0-58f4c8afe108/tasks/26e61e19-f934-4713-ad8d-d5d614c8274e/triophaser/standard.out
But then this failed when I used my real data..... as you can see, that the actual commands are pretty much identical except for the file names, which leads me to believe there is some weird directory related issue associated with the haplotype reference and maybe where the outputs are being directed. This is because, when we tried the same command but set -r
to ./
, the platform picked up the shapeit2 reference files and, once again, only the .tbi for the output file:
python3 /trio_phaser.py --child_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/son_ashkenazim_GRCh38_chr22.g.vcf.gz --maternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/mother_ashkenazim_GRCh38_chr22.g.vcf.gz --paternal_file /sbgenomics/Projects/24d7cac4-8be1-4639-83b0-58f4c8afe108/Test_TrioPhaser/father_ashkenazim_GRCh38_chr22.g.vcf.gz --haplotype_reference_files ./ --build_version 38 --number_of_tasks 10 --output_file ./son_ashkenazim_GRCh38_chr22.trioPhaser.vcf > /sbgenomics/workspaces/24d7cac4-8be1-4639-83b0-58f4c8afe108/tasks/e4710c4b-fff1-4bbd-a36c-09ee0fd3dc42/triophaser/strdout.txt
I am currently running more configurations to try to get it working, like setting the output to ~/
just fyi, I have run multiple developers' docker images on this platform and I have not had issues like this before.
Ok, I figured out a consistent fix. I just need to add append a ./
before the output file name and it works (and -r
to /data
ofc)
but please see if there is a cleaner way for the directories to be set up or something!
I am trying to the docker image on a cloud-based platform to run the pipeline but am running into problems because I am not running an interactive environment where I can create directories. Even the example doesnt work here. I tried various ways to just put everything in the entry point, since the cloud platform automatically mounts (whatever that means, I actually have 0 experience with docker):
python3 /trio_phaser.py -c /trioPhaser/validate/son_ashkenazim_GRCh38_chr22.g.vcf.gz -p /trioPhaser/validate/father_ashkenazim_GRCh38_chr22.g.vcf.gz -m /trioPhaser/validate/mother_ashkenazim_GRCh38_chr22.g.vcf.gz -o phased_output.vcf.gz -r /data > /data/trio_phaser.out
and this:python3 /trio_phaser.py -c /trioPhaser/validate/son_ashkenazim_GRCh38_chr22.g.vcf.gz -p /trioPhaser/validate/father_ashkenazim_GRCh38_chr22.g.vcf.gz -m /trioPhaser/validate/mother_ashkenazim_GRCh38_chr22.g.vcf.gz -o phased_output.vcf.gz -r /data/ > /data/trio_phaser.out
and this:python3 /trio_phaser.py -c /trioPhaser/validate/son_ashkenazim_GRCh38_chr22.g.vcf.gz -p /trioPhaser/validate/father_ashkenazim_GRCh38_chr22.g.vcf.gz -m /trioPhaser/validate/mother_ashkenazim_GRCh38_chr22.g.vcf.gz -o /data/phased_output.vcf.gz -r /data/ > /trio_phaser.out
and i get errors like:
or, in the error log, i get this:
but, at the same time, the standard out returned this: