Closed dewuem closed 1 year ago
Hi,
Indeed, for the first part, it seems that singularity is not auto-mounting your custom temporary directory. You might need to configure singularity or nextflow for your own particular system. Unfortunately, different computing environments are configured in various ways that we cannot account for.
As for the second part, yes we do overwrite TMPDIR, because by default, that pointed to a location with very small disk space on our systems. Although this change will not be seen outside that one nextflow process. I agree, we should try harder to detect an appropriate temporary directory.
Hi,
Thanks for your response. While you, of course, cannot account for all eventualities, our setup is not unlikely for clusters where the local nodes do not have individual disks.
It would be more generally applicable to mount $TMPDIR instead of assuming that a hard-coded /scratch will work. In any case, the software seems to be looking in the right directory; it is not accessible from within singularity (if indeed this is the reason). As I stated, the $TMPDIR is accessible. Therefore, this seems to be plausible to be the underlying cause. I think this could be, in fact, a bug that, by chance, works in your particular setup. The fix then would be to have singularity mount $TMPDIR. This should work in more setups without modification and make this pipeline accessible to others.
I think your case can be configured quite easily by changing the nextflow.config file you find in GraffiTE. Change the line:
singularity.runOptions = '--contain'
to
singularity.runOptions = '--contain -B /scratch:/tmp'
or
singularity.runOptions = '--contain -B /scratch:/scratch'
if $TMPDIR
points to /scratch in the container environment.
This should bind /tmp
inside the container to /scratch
on your host system.
I did look at other nf-core pipelines and they all seem to delegate system specific configuration to nextflow, which makes sense since developers cannot make any assumptions about the host system (even if certain configurations are common).
Thanks for the tip, I ended up removing the "--contain" flag altogether, since it always seemed to insist on going to the $TMPDIR path whatever I tried to bind. That worked, however, now the pipeline stops after the first TSD step. The job finished as successful, no error was generated. But the pipeline is not complete I think, as there is no 3_... folder in the output. The last result folder generated is 2_Repeat_Filtering
with
genotypes_repmasked_filtered.vcf repeatmasker_dir
The job output:
executor > local (1)
[ba/f7b952] process > svim_asm (5) [100%] 16 of 16, cached: 16 ✔
[06/c2782e] process > repeatmasker (1) [100%] 1 of 1, cached: 1 ✔
[bd/7be51f] process > tsd_prep (1) [100%] 1 of 1 ✔
[- ] process > tsd_search -
[- ] process > tsd_report -
The .command.sh:
#!/bin/bash -ue
ls *.vcf > vcfs.txt
SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100 genotypes.vcf
repmask_vcf.sh genotypes.vcf genotypes_repmasked.vcf.gz combi_repmod_repbase_26_01_dfam_3_5_insecta.lib
bcftools view -G genotypes_repmasked.vcf.gz | awk -v FS=' ' -v OFS=' ' '{if($0 ~ /#CHROM/) {$9 = "FORMAT"; $10 = "ref"; print $0} else if(substr($0, 1, 1) == "#") {print $0} else {$9 = "GT"; $10 = "1|0"; print $0}}' | awk 'NR==1{print; print "##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">"} NR!=1' | bcftools view -i 'INFO/total_match_span > 0.80' -o genotypes_repmasked_temp.vcf
fix_vcf.py --ref hifiasm_scaff10x_arks.fa.masked --vcf_in genotypes_repmasked_temp.vcf --vcf_out genotypes_repmasked_filtered.vcf
Hello!
There might be an issue with the output of the second process, leading to a mistake in the files made by tsd_prep
. This last process must be emitting an empty channel to the next process tsd_search
which is waiting for something.
Check that the vcf in 2_Repeat_Filtering
has variants. If not (only a header), it could be that the process did not complete as intended, but did not run into an error and is now cached. So I would maybe want try to restart the whole thing altogether (without -resume
) to see if it makes a difference.
To spare the alignment time, you may want to use a new option we have that can take a VCF as input. You will need to merge the individual VCFs present in 1_SV_search
with SURVIVOR as follow:
cd 1_SV_search/
cat *.vcf > vcfs.txt
SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100 merged.vcf
Then you run GraffiTE
by removing --assemblies
and adding --vcf merged.vcf
. It will use this VCF as input for RepeatMasker and will go on with the rest of the processes.
Let me know!
Clém
Hello,
Thank you. I tried, however, the pipeline does not progress past [c2/6fcac3] process > tsd_prep (1) [100%] 1 of 1 ✔
regardless. I did not -resume
.
.command.sh
:
#!/bin/bash -ue
cp repeatmasker_dir/repeatmasker_dir/* .
prepTSD.sh hifiasm_scaff10x_arks.fa.masked 30
.command.log
:
extracting flanking...
index file hifiasm_scaff10x_arks.fa.masked.fai not found, generating...
extracting SVs' 5' and 3' ends...
Folder content:
total 1111825
-rw-r--r-- 1 fr_de1013 fr_fr 120932888 Dec 1 19:26 ALL.onecode.elem_sorted.bak
-rw-r--r-- 1 fr_de1013 fr_fr 86525 Dec 1 19:26 OneCode_LTR.dic
-rw-r--r-- 1 fr_de1013 fr_fr 10 Dec 1 19:26 SV_sequences_L_R_trimmed_WIN.fa
-rw-r--r-- 1 fr_de1013 fr_fr 0 Dec 1 19:26 flanking_sequences.fasta
-rw-r--r-- 1 fr_de1013 fr_fr 5176 Dec 1 19:26 gLength.txt
lrwxrwxrwx 1 fr_de1013 fr_fr 132 Dec 1 19:26 genotypes_repmasked_filtered.vcf -> [my path]/work/bc/df1882c112dc9c24e3bebd9d464a51/genotypes_repmasked_filtered.vcf
lrwxrwxrwx 1 fr_de1013 fr_fr 99 Dec 1 19:26 hifiasm_scaff10x_arks.fa.masked -> [my path]/hifiasm_scaff10x_arks.fa.masked
-rw-r--r-- 1 fr_de1013 fr_fr 15040 Dec 1 19:26 hifiasm_scaff10x_arks.fa.masked.fai
-rw-r--r-- 1 fr_de1013 fr_fr 225140303 Dec 1 19:26 indels.fa.cat.gz
-rw-r--r-- 1 fr_de1013 fr_fr 240145914 Dec 1 19:26 indels.fa.masked
-rw-r--r-- 1 fr_de1013 fr_fr 73953562 Dec 1 19:26 indels.fa.onecode.out
-rw-r--r-- 1 fr_de1013 fr_fr 119396982 Dec 1 19:26 indels.fa.out
-rw-r--r-- 1 fr_de1013 fr_fr 130131 Dec 1 19:26 indels.fa.out.length
-rw-r--r-- 1 fr_de1013 fr_fr 179322798 Dec 1 19:26 indels.fa.out.log.txt
-rw-r--r-- 1 fr_de1013 fr_fr 2480 Dec 1 19:26 indels.fa.tbl
-rw-r--r-- 1 fr_de1013 fr_fr 0 Dec 1 19:26 indels.txt
-rw-r--r-- 1 fr_de1013 fr_fr 0 Dec 1 19:26 oneHit_SV_coordinates.bed
-rw-r--r-- 1 fr_de1013 fr_fr 0 Dec 1 19:26 oneHit_SV_coordinates_win.bed
-rw-r--r-- 1 fr_de1013 fr_fr 0 Dec 1 19:26 oneHit_indels.fa.masked
-rw-r--r-- 1 fr_de1013 fr_fr 179323206 Dec 1 19:26 onecode.log
drwxr-xr-x 2 fr_de1013 fr_fr 8192 Dec 1 23:22 repeatmasker_dir
Cheers Daniel
Hello Daniel,
Sorry for the delays, I was travelling. Can you email me your VCF?
Cheers,
Clém
Hey Dan,
I checked your files. Indeed the first vcfs generated had no variants, something must have happened.
The manually merged vcf seems good to run with --vcf
however, you send me a .gz
compressed version, it need to be uncompressed to be used by GraffiTE for now. Can you confirm/refute this hypothesis?
In any case, I recommend to update the github repos (the image is unchanged however) because I fixed some small bugs recently when the pipeline runs without the --mammal
flag.
Cheers,
Clém
Hi Clém,
I just compressed the file manually to send to you. Trying the update. Thanks :)
Cheers D
On Tue, 13 Dec 2022, 11:33 Clément Goubert, @.***> wrote:
Hey Dan,
I checked your files. Indeed the first vcfs generated had no variants, something must have happened. The manually merged vcf seems good to run with --vcf however, you send me a .gz compressed version, it need to be uncompressed to be used by GraffiTE for now. Can you confirm/refute this hypothesis?
In any case, I recommend to update the github repos (the image is unchanged however) because I fixed some small bugs recently when the pipeline runs without the --mammal flag.
Cheers,
Clém
— Reply to this email directly, view it on GitHub https://github.com/cgroza/GraffiTE/issues/8#issuecomment-1348153035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6GYR74SHREWUYK5KHGIS3WNBGGJANCNFSM6AAAAAASHU7EYA . You are receiving this because you authored the thread.Message ID: @.***>
Thanks for the tip, I ended up removing the "--contain" flag altogether, since it always seemed to insist on going to the $TMPDIR path whatever I tried to bind. That worked, however, now the pipeline stops after the first TSD step. The job finished as successful, no error was generated. But the pipeline is not complete I think, as there is no 3_... folder in the output. The last result folder generated is
2_Repeat_Filtering
withgenotypes_repmasked_filtered.vcf repeatmasker_dir
The job output:
executor > local (1) [ba/f7b952] process > svim_asm (5) [100%] 16 of 16, cached: 16 ✔ [06/c2782e] process > repeatmasker (1) [100%] 1 of 1, cached: 1 ✔ [bd/7be51f] process > tsd_prep (1) [100%] 1 of 1 ✔ [- ] process > tsd_search - [- ] process > tsd_report -
The .command.sh:
#!/bin/bash -ue ls *.vcf > vcfs.txt SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100 genotypes.vcf repmask_vcf.sh genotypes.vcf genotypes_repmasked.vcf.gz combi_repmod_repbase_26_01_dfam_3_5_insecta.lib bcftools view -G genotypes_repmasked.vcf.gz | awk -v FS=' ' -v OFS=' ' '{if($0 ~ /#CHROM/) {$9 = "FORMAT"; $10 = "ref"; print $0} else if(substr($0, 1, 1) == "#") {print $0} else {$9 = "GT"; $10 = "1|0"; print $0}}' | awk 'NR==1{print; print "##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">"} NR!=1' | bcftools view -i 'INFO/total_match_span > 0.80' -o genotypes_repmasked_temp.vcf fix_vcf.py --ref hifiasm_scaff10x_arks.fa.masked --vcf_in genotypes_repmasked_temp.vcf --vcf_out genotypes_repmasked_filtered.vcf
Hi Clém I am in the same boat (I get exactly the same output and the GraffiTE ends at tsd_prep step. I believe my problem is not with tmp directory. My genotypes_repmasked_filtered.vcf file is also very similar. There is no record of any variants.
Best regards, Gautam
Hi all, happy new year!
An update from me, as Clém suggested, the issue was in the scaffold names for me. Specifically there were commas. One of the scaffolding programs created commas in the fasta sequence headers which then some program in the pipeline could not deal with. Replacing the commas with underscores fixed the issue. The pipeline completed successfully. Thanks for all your help.
Cheers Daniel
Hi all, Happy new year to you too! @dewuem : I am glad this fix worked for you, and thanks for suggesting this issue with contig names. I also tried fixing contig names. Underscores did not work, but plain numbering them like chr1, chr3, ..... worked. I am now in genotyping stage (with Vg). Thanks to both of you for all the discussion and various suggestions.
@cgroza and @clemgoub : Thanks for developing such a great pipeline (with a singularity container)!!
Cheers, Gautam
Hello all and Happy New year!
I am so glad you could fix your issues! Your feedback is important to improve the pipeline documentation! Please let us know how things go and if you need further help!
Cheers,
Clément
copying here my message to @dewuem about the scaffold/contig/chromosome name issues:
hooo I think bcftools doesn’t like the contig’s header. I found this at the end of the repeatmasker log, which correspond to the step where we annotate the vcf with the repeats:
[W::vcf_parse] Contig 'scaffold1,52763537,f1Z52763537' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=scaffold1,52763537,f1Z52763537>"
[E::vcf_parse] Could not add dummy header for contig 'scaffold1,52763537,f1Z52763537'
I think it fixes the first line by itself, but it might no like the header. To test, I would only use 1 contig (maybe the longest) and rename it without comma, then restart from the top of the pipeline (--assemblies and not --vcf) to be sure.
See also issue #12
Hi,
I seem to have an issue at the Repeatmasker stage of the pipeline, that could be related to how the $TMPDIR is used. I am trying to run GraffiTE on a slurm based cluster where $TMPDIR is assigned per job. It contains a path to a folder in /scratch, the pattern is /scratch/user-ID_job_354286_o07c02. Only in this path can the job actually write to scratch. The error says this path is not writeable:
etc. From there on it keeps failing. Running the script interactively, shows that the $TMPDIR is in fact writeable, since I could write and read files there even after the pipeline failed.
From searching for the error or similar ones, I found that the 'sort' command makes use of the TMPDIR, but cannot access it. I found that it might be necessary to bind the location of $TMPDIR in singularity (but I am not sure if this step is in singularity). https://github.com/nf-core/chipseq/issues/123#issuecomment-557805550
At the same time, in the pipeline, the system set $TMPDIR appears to be overwritten (excerpt from main.nf lines 323 to 334, newest code from the repo):
I am not sure if the pipeline is at this stage yet, but that might be another potential issue in any case, since all programs that access $TMPDIR will not find it after that change.
Here is the .command.sh
I hope I was clear in my description, if you need additional information please let me know. Thanks for your help.
Edit: Legibility of code