Closed h171e closed 1 year ago
@smlmbrt @ens-lgil @nebfield @fyvon
Is it possible that someone from the team or a contact be able to help me start getting some results by setting and testing a linux virtual machine thru a service such as AWS? I created an account on AWS but I don't know how to setup and use, however I believe there is a higher probability of succesfully executing the pgsc_calc pipeline with one of their linux services.
They offer Amazon Linux, CentOS, Debian, Kali, Red Hat, SUSE, and Ubuntu. I am guessing there is a higher chance of success to run the programs with one of these distros. I wonder which one or which ones are most popular for people involved in using and developing the pgsc_calc pipeline.
But I think that it could be much easier and faster if I could just give access to a server opened thru my AWS account to someone with experience who is willing and able to try to create a working environment capable of running the calculator and then maybe I could use the environment created by this person to upload my VCF file and running the calculator in this server. I would need detailed instructions on how to setup a server and enable access thou.
If required I am willing to pay a fee for this kind of service specially since I have little chance to fix this issue on my own, spending too much time and effort without success, and would like to start using the calculator as soon as possible.
Another posibility is to try installing everything on my computer with a different linux distro that could be more compatible and capable of doing the job. I aready tried with Fedora without success thou. I would not know which distro if any has better compatibility.
Or maybe there is an alternative more tedious way to run the calculator by downloading the needed pgs files and manually placing them on a folder with the correct settings and running the steps in a different way.
Any suggestion would be greatly appreciated. Thank you.
Hi @hectorfp3000, we'll try to help you run the pipeline on your computer and fix anything that might be a bug on our end. It's best to only try out the test command first (nextflow run pgscatalog/pgsc_calc -profile test,docker
) - if it doesn't work then it definitely won't work on a custom command.
First, it's worth checking that docker is running correctly, could you try the following command:
docker run hello-world
Second, I think your samplesheet might be malformed, the second line should be (because the vcf_path should be the root filename of the vcf: NG13RY1WVD,home/hector/Downloads/NG13RY1WVD,,,
Hi @hectorfp3000, we'll try to help you run the pipeline on your computer and fix anything that might be a bug on our end. It's best to only try out the test command first (
nextflow run pgscatalog/pgsc_calc -profile test,docker
) - if it doesn't work then it definitely won't work on a custom command.First, it's worth checking that docker is running correctly, could you try the following command:
docker run hello-world
Second, I think your samplesheet might be malformed, the second line should be (because the vcf_path should be the root filename of the vcf:
NG13RY1WVD,home/hector/Downloads/NG13RY1WVD,,,
@smlmbrt I am able to run the command:
docker run hello-world
I was in the process of documenting how I was able to run the test
nextflow run pgscatalog/pgsc_calc -profile test,docker
and have all steps completed with nextflow installed as root in a created directory /root/bin and docker in the regular directory with the command
/root/bin/nextflow run pgscatalog/pgsc_calc -profile test,docker
executing this command as the user root but failed at step 1 with a different warning trying the test with my own file.
However I had to install linux again since it was failing wifi connection and presented slow or altered responses to some tasks after enabling access to permissions for many files and folders in the operating system.
I was able to install singularity with an alternative set of instructions provided by a gibhub post at NIH-HPC / Singularity-tutorial by substituting the posted versions for the latest versions:
https://github.com/Singularity-tutorial/Singularity-tutorial.github.io/tree/master/01-installation
and installing go with these instructions provided in a website called rosehosting.com and replacing the posted version with the latest version 1.20.7
https://www.rosehosting.com/blog/how-to-install-go-golang-compiler-on-ubuntu-20-04/
Steps completed with singularity for the test. I believe the reason for being able to run the test was not necesarily due to being able to install singularity with this alternative method but due to executing a command listed in these instructions which install dependencies in a different way to the instructions listed in the official singularity website since after this I installed docker and was able to also run the test and have my own vcf file processed at step 1 and 2, which was not possible before, so I believe its possible installing this dependencies with this command also enable me to run the pipeline using nextflow with docker since now the pipeline completed steps 1 and 2. The command listed to install the dependencies for singularity was:
sudo apt-get install -y build-essential libssl-dev uuid-dev libgpgme11-dev \
squashfs-tools libseccomp-dev wget pkg-config git cryptsetup debootstrap
However I was not able to run a calculation with my VCF file presenting an error message at step 3:
Command error: FATAL: container creation failed: mount /proc/self/fd/3->/usr/local/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: failed to find loop device: could not attach image file to loop device: no loop devices available
I was in the process of documenting to present the results with other findings in this post however I decided to install docker again and try to run the test as root with nextflow installed in the root/bin folder. This time the test also completed.
I had changed the inputs for the samplesheet to reflect your instructed suggestion, however I am not sure that I made the correct changes. Had the input under the first cell named sampleset to reflect the name of the compressed vcf file as NG13RY1WVD.vcf.gz and the input under the first cell on column b named vcf_path set to reflect the location and name of the file /home/hector/Downloads/NG13RY1WVD.vcf.gz with the other cells empty.
This time it seemed to complete steps 1 and 2 but failed at step 6 indicating a different error message:
ERROR ~ Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF (NG13RY1WVD.vcf.gz chromosome ALL)'
Caused by:
Process PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF (NG13RY1WVD.vcf.gz chromosome ALL)
terminated with an error exit status (6)
Command executed:
plink2 \ --threads 2 \ --memory 7168 \ --set-all-var-ids '@:#:$r:$a' \ --max-alleles 2 \ --new-id-max-allele-len 100 missing \ --vcf NG13RY1WVD.vcf.gz \ --make-pgen vzs\ --out vcf_NG13RY1WVD.vcf.gzALL # 'vcf' prefix is important
cat <<-END_VERSIONS > versions.yml "PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF": plink2: $(plink2 --version 2>&1 | sed 's/^PLINK v//; s/ 64.*$//' ) END_VERSIONS
I then tried the same test with nextflow on the regular home/user/bin directory changing the command to use singularity and this time with the max_memory set to reflect system capacity in the nextflow.config file in the pgsc_calc folder and this time I had the same error message I had with docker. This could suggest that the previous singularity error message
Command error: FATAL: container creation failed: mount /proc/self/fd/3->/usr/local/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: failed to find loop device: could not attach image file to loop device: no loop devices available
could be related to the max_memory attribute on the nextflow.config file not being set to reflect system capacity.
However now I am not sure it would be possible for me to figure out what the source of this error is since it seems to be related to an incompatibility between the vcf file, the samplesheet file, and the pipeline process, and not the operating system or the configurations for nextflow or docker.
I appreciate all your efforts in assisting me with this problem and hope that with your help I will be able to use the program soon.
I apologize for the long post but since I am not experienced with using terminal commands to install and execute programs I am trying my best to document and present with care for detail specially since I value the possibility of everyone with an interest and need to enjoy the potential capabilities and results of this pipeline to find a solution to the same issue or a similar error from your assistance to this post. Thank you.
It's good that docker works with the test profile. You're correct that the PLINK2_VCF
error you're reporting is because of a problem with your genomic data, not a problem with your operating system or computer configuration.
Only chromosomes 1-22, X, and Y are supported in VCF file input, but your VCF contains some variants that need to be removed or ignored (chr1_k....
).
To ignore these extra variants you can edit the file conf/modules.config
at line 41. Try changing:
withName: PLINK2_VCF {
ext.args = "--new-id-max-allele-len 100 missing"
}
to
withName: PLINK2_VCF {
ext.args = "--new-id-max-allele-len 100 missing --allow-extra-chr"
}
You should be able to find and edit this file by running:
$ cd $HOME/.nextflow/assets/pgscatalog/pgsc_calc
$ nano conf/modules.config
nano
is a good text editor for making simple changes to a text file. Make the changes, save the file, and try rerunning the workflow.
It's good that docker works with the test profile. You're correct that the
PLINK2_VCF
error you're reporting is because of a problem with your genomic data, not a problem with your operating system or computer configuration.Only chromosomes 1-22, X, and Y are supported in VCF file input, but your VCF contains some variants that need to be removed or ignored (
chr1_k....
).To ignore these extra variants you can edit the file
conf/modules.config
at line 41. Try changing:withName: PLINK2_VCF { ext.args = "--new-id-max-allele-len 100 missing" }
to
withName: PLINK2_VCF { ext.args = "--new-id-max-allele-len 100 missing --allow-extra-chr" }
You should be able to find and edit this file by running:
$ cd $HOME/.nextflow/assets/pgscatalog/pgsc_calc $ nano conf/modules.config
nano
is a good text editor for making simple changes to a text file. Make the changes, save the file, and try rerunning the workflow.
@nebfield
I made the changes to this file. This time the pipeline executed steps 1-3 and 6-7 but failed step 8 MAKE_COMPATIBLE:MATCH_COMBINE (NG13RY1WVD.vcf.gz)
In the error message there is a section titled Command error: that seems to be listing processes related to the pipeline. Most of the lines in this list begin with pgscatalog_utils.. The last three lines before a new list named Traceback (most recent call last): seem most important in describing the error for this step and looks that it points to an incompatibility resulting from an attribute or setting that defines something called a minimum matching threshold
I don’t know what this refers if its related to my VCF file, to the quality or fidelity of the data produced by the genome sequencing procedure and I might need to get a higher quality whole genome sequencing test done to get a better VCF file with higher coverage, from an error in the pipeline configuration, or something else.
I wonder if other files related to a test could be used or integrated into the pipeline which might improve calculation accuracy or coverage. The company I used for testing provided CRAM, CRAI, FASTq, and TBI files produced from my test that are available for me to download.
I changed the target build attribute in the pipeline execution command from --target_build GRCh38 to --target_build CRCH37 to see if results where different and this time the error message looked to be the same but with a (0.43% variants match) in the line that says
@nebfield
I changed the min_overlap option to 0.00 in the nextflow.config file for the pipeline. This time it seems like step 8 was completed but failed at step 9, although the Cause by: line ends indicating this terminated with an error exit status (6) and maybe its relating something to step 6. The Command error: portion of the output error message says:
Command error: Error: Invalid chromosome code 'chr1_KI270706v1_random' on line 4713526 of .pvar file. (Use --allow-extra-chr to force it to be accepted.) PLINK v2.00a3.3LM 64-bit Intel (3 Jun 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to NG13RY1WVD.vcf.gz_ALL_additive_0.log. Options in effect: --memory 7168 --out NG13RY1WVD.vcf.gz_ALL_additive_0 --pfile vzs vcf_NG13RY1WVD.vcf.gz_ALL --score NG13RY1WVD.vcf.gz_ALL_additive_0.scorefile.gz zs header-read cols=+scoresums,+denom,-fid no-mean-imputation --seed 31 --threads 2
Start time: Fri Aug 4 18:43:18 2023 7818 MiB RAM detected; reserving 7168 MiB for main workspace. Using up to 2 compute threads. 1 sample (0 females, 0 males, 1 ambiguous; 1 founder) loaded from vcf_NG13RY1WVD.vcf.gz_ALL.psam.
End time: Fri Aug 4 18:43:20 2023
It seems the pipeline was able to complete all steps by adding the line
ext.args = "--new-id-max-allele-len 100 missing --allow-extra-chr"
to the the process withName: PLINK2_SCORE before the ext.args2 key located in row 53 of the modules.config file located in the pgsc_calc/config folder
I opened the html report and it looks like the main attribute in the report is the match % for each score presented in the summary section.
I guess this is the % of variants correlated between my own file and the variants in the study. Is this adjusted or takes into account calculations related to statistics for the contribution weight to trait development risk for each variant or make adjustments accounting for populations?
I don't know much about how polygenic scores work or how they are calculated, and I apologize for this, but from the information I have seen there are important variables like effect size and significance for each variant which define its contribution to the score and to calculate a percentile the results need to include a comparison to the imputed data of a set. Is this something that can be applied to these results or added to the calculation?
You can read about the data in the report within the documentation: https://pgsc-calc.readthedocs.io/en/latest/explanation/output.html
Those numbers give the number of variants in the scores that match your genotypes. These are quite low, likely because you're using non-imputed genotyping data (e.g. just variant calls from whole genome sequencing) or something else. If you actually have a gVCF file it may be possible to genotype those positions using other software to get better overall mapping: https://github.com/PGScatalog/pgsc_calc/discussions/123#discussioncomment-6469422, or it's also possible that your genotypes are in the wrong build.
The actual PGS values are shown for reference at the bottom of the report, but also output in a separate file.
Description of the bug
Hi. I hope to get some help or suggestions. Have no experience and only installed Ubuntu for trying to run and use pgsc_calc for personal use on Ubuntu 22.04.2 that I installed for the first time on a new ssd drive.
Installed docker engine and docker desktop, created key and account to logging. It seems to be installed correctly although I have not made any specific test. Next installed nextflow. Used online guides provided by them and also some youtube tutorials for reference.
I would prefer to learn and figure out how to run this program on my pc. Since I have limited experience and knowledge on how to install, run, and modify related programs and debug errors more help, time, and effort will be required, and even then there is no guarantee that I will be able to run the program on my pc. It seems like a possible temporary solution could be having access to a virtual machine with all the required dependencies installed, perhaps created and tested by an expert or program user that is able to understand how the whole system works.
I wonder if perhaps someone might be able to help in finding a person or being a user that works with this program that might be able provide help in this way. Maybe someone is willing to be kind enough to make this effort out of a personal interest in enabling a someone like me to use the tool to fulfill an interest for inquiry in personal genomic polygenic scores.
I have and have been dealing with a pressing urge to be able to get polygenic scores from the catalog and its been very difficult to get the linux system to run this calculator. I have spent many hours trying to figure this out to no avail and I wish I could at least start getting some results at least from a system installed in a virtual host.
However with this said I will proceed to list my description and documentation of the process to see if someone might be able to help me fix it.
After installing nextflow and docker tried this on the terminal.
nextflow run pgscatalog/pgsc_calc --help
It displayed correctly. Tried the next test and it did not work.
nextflow run pgscatalog/pgsc_calc -profile test,docker
It displays messages in red
ERROR~Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:DOWNLOAD_SCOREFILES ([pgs_idGS002786, pgp_id:, trait_efo:])'
Further down it says:
Command error: touch: cannot touch '.command.trace': Permission denied
I then created a csv sheet with the details for my own VCF file to test.
And the result I believe is the same as with the previous test:
I tried these commands to add read and write permissions for all folders but it did not change anything. chown user chmod o+w chmod ugo+rwx
It seems every time a test run is made a new folder is created in the work folder with a random name and the needed permissions are not configured.
I searched the permissions for the work folder, the permission access for the owner, group, and others access is set to create and delete files.
But the the permissions for the enclosed files for the others category is set to read-only for files and access for folders. I tried to change this to read and write and create and delete files but when I close the window by clicking change it won't save and make the changes.
I tried to run a test with my own VCF file again for a polygenics score calculation “--pgs_id PGS000908”
nextflow run pgscatalog/pgsc_calc \ -profile docker \ --input samplesheet.csv --target_build GRCh37 \ --pgs_id PGS000908
It returned the same error message.
It is suggested in the terminal error message to replicate the error with the following instruction:
Tip: you can replicate the issue by changing to the process work dir and entering the following command ‘bash .command.run’
I opened the files application to identify and open the suggested folder:
~/work/30/1656242c1551aebe610dacb7af1b55
This folder contained 8 files comprised of:
1 text file link to a csv document: sameplesheet.csv
4 plain text files: .command begin .command.err .command.out .exitcode
1 application log (text/x-log) .command.log
2 shell script files .command.run .command.sh
I right clicked the folder and selected the “open in terminal” option and wrote the suggested command “bash .command.run” which returned the same error message “touch: cannot touch ‘.command.trace’: Permission denied”
I went back to the “~/work/30” folder with the “cd ../” command and checked the read, write, execute permissions using the “ls -l” command and found that there was a write permission missing for the others group. I added the permission using the “chmod o+w” for the “~/work/30/1656242c1551aebe610dacb7af1b55” folder, checked again for the read, write, execute permissions, and found that it now has the write permission for the others group so proceeded to execute the “bash .command.run” command again which seemed to be executed now correctly returning a series of messages which I don’t know the meaning or relevancy of.
I then found that now in the “~/work/30/1656242c1551aebe610dacb7af1b55” folder there are 3 new files that where written with limited permissions.
1 plain text document: .command.trace
1 JSON document: out.json
1 YAML document: versions.yml
When I run the original command again to see if the calculator proceeds to the next step
nextflow run pgscatalog/pgsc_calc \ -profile docker \ --input samplesheet.csv --target_build GRCh37 \ --pgs_id PGS000908
it creates new files inside a new folder in the work folder which again is missing the permission and the output is the same error.
Tried to install singularity but was not able to.
Command used and terminal output
No response
Relevant files
No response
System information
Hardware Type: Custom Desktop PC Software: Ubuntu 22.04.2 Processor: Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz Graphics Card: NVIDIA GeForce GTX 1650 Motherboard: Gigabyte Ultra Durable Z87X-D3H F9 Bios RAM: 8 GB DDR3 Disk: Vulcan Z 2.5 SSD SATA III 6Gb/s