Possible issue related to $TMPDIR

dewuem commented 2 years ago

Hi,

I seem to have an issue at the Repeatmasker stage of the pipeline, that could be related to how the $TMPDIR is used. I am trying to run GraffiTE on a slurm based cluster where $TMPDIR is assigned per job. It contains a path to a folder in /scratch, the pattern is /scratch/user-ID_job_354286_o07c02. Only in this path can the job actually write to scratch. The error says this path is not writeable:

compute repeat proportion for each SVs...
sort: cannot create temporary file in '/scratch/fr_de1013_job_354286_o07c02': No such file or directory
sort: cannot read: span: No such file or directory
sort: cannot create temporary file in '/scratch/fr_de1013_job_354286_o07c02': No such file or directory
Mammalian filters ON. Filtering...
awk: cmd. line:1: warning: regexp escape sequence '\#' is not a known regexp operator
sort: cannot create temporary file in '/scratch/fr_de1013_job_354286_o07c02': No such file or directory
sort: cannot read: TwP.txt: No such file or directory
writing vcf...
mktemp: failed to create file via template '/scratch/fr_de1013_job_354286_o07c02/tmp.XXXXXXXXXX': No such file or directory
/home/fr/fr_fr/fr_de1013/bin/GraffiTE/bin/repmask_vcf.sh: line 121: ${HDR_FILE}: ambiguous redirect

etc. From there on it keeps failing. Running the script interactively, shows that the $TMPDIR is in fact writeable, since I could write and read files there even after the pipeline failed.

From searching for the error or similar ones, I found that the 'sort' command makes use of the TMPDIR, but cannot access it. I found that it might be necessary to bind the location of $TMPDIR in singularity (but I am not sure if this step is in singularity). https://github.com/nf-core/chipseq/issues/123#issuecomment-557805550

At the same time, in the pipeline, the system set $TMPDIR appears to be overwritten (excerpt from main.nf lines 323 to 334, newest code from the repo):

switch(params.graph_method) {
                case "giraffe":
                    prep + """
                    vg autoindex --tmp-dir \$PWD  -p index/index -w giraffe -v sorted.vcf.gz -r ${fasta}
                    """ + finish
                    break
                case "graphaligner":
                    prep + """
                    export TMPDIR=$PWD
                    vg construct -a  -r ${fasta} -v ${vcf} -m 1024 > index/index.vg
                    """ + finish
                    break

I am not sure if the pipeline is at this stage yet, but that might be another potential issue in any case, since all programs that access $TMPDIR will not find it after that change.

Here is the .command.sh

#!/bin/bash -ue
ls *.vcf > vcfs.txt
SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100 genotypes.vcf
repmask_vcf.sh genotypes.vcf genotypes_repmasked.vcf.gz combi_repmod_repbase_26_01_dfam_3_5_insecta.lib
bcftools view -G genotypes_repmasked.vcf.gz |   awk -v FS='  ' -v OFS='   '   ’{if($0 ~ /#CHROM/) {$9 = “FORMAT”; $10 = “ref”; print $0} else if(substr($0, 1, 1) == “#”) {print $0} else {$9 = “GT”; $10 = “1|0”; print $0}}' |   awk ‘NR==1{print; print “##FORMAT=<ID=GT,Number=1,Type=String,Description=“Genotype”>“} NR!=1’ |   bcftools view -i ‘INFO/total_match_span > 0.80’ -o genotypes_repmasked_temp.vcf
fix_vcf.py --ref hifiasm_scaff10x_arks.fa.masked --vcf_in genotypes_repmasked_temp.vcf --vcf_out genotypes_repmasked_filtered.vcf

I hope I was clear in my description, if you need additional information please let me know. Thanks for your help.

Edit: Legibility of code

cgroza commented 2 years ago

Hi,

Indeed, for the first part, it seems that singularity is not auto-mounting your custom temporary directory. You might need to configure singularity or nextflow for your own particular system. Unfortunately, different computing environments are configured in various ways that we cannot account for.

As for the second part, yes we do overwrite TMPDIR, because by default, that pointed to a location with very small disk space on our systems. Although this change will not be seen outside that one nextflow process. I agree, we should try harder to detect an appropriate temporary directory.

dewuem commented 2 years ago

Hi,

Thanks for your response. While you, of course, cannot account for all eventualities, our setup is not unlikely for clusters where the local nodes do not have individual disks.

It would be more generally applicable to mount $TMPDIR instead of assuming that a hard-coded /scratch will work. In any case, the software seems to be looking in the right directory; it is not accessible from within singularity (if indeed this is the reason). As I stated, the $TMPDIR is accessible. Therefore, this seems to be plausible to be the underlying cause. I think this could be, in fact, a bug that, by chance, works in your particular setup. The fix then would be to have singularity mount $TMPDIR. This should work in more setups without modification and make this pipeline accessible to others.

cgroza commented 2 years ago

I think your case can be configured quite easily by changing the nextflow.config file you find in GraffiTE. Change the line:

singularity.runOptions = '--contain'

to

singularity.runOptions = '--contain -B /scratch:/tmp'

or

singularity.runOptions = '--contain -B /scratch:/scratch'

if $TMPDIR points to /scratch in the container environment.

This should bind /tmp inside the container to /scratch on your host system. I did look at other nf-core pipelines and they all seem to delegate system specific configuration to nextflow, which makes sense since developers cannot make any assumptions about the host system (even if certain configurations are common).

dewuem commented 1 year ago

Thanks for the tip, I ended up removing the "--contain" flag altogether, since it always seemed to insist on going to the $TMPDIR path whatever I tried to bind. That worked, however, now the pipeline stops after the first TSD step. The job finished as successful, no error was generated. But the pipeline is not complete I think, as there is no 3_... folder in the output. The last result folder generated is 2_Repeat_Filtering with genotypes_repmasked_filtered.vcf repeatmasker_dir

The job output:

executor >  local (1)
[ba/f7b952] process > svim_asm (5)     [100%] 16 of 16, cached: 16 ✔
[06/c2782e] process > repeatmasker (1) [100%] 1 of 1, cached: 1 ✔
[bd/7be51f] process > tsd_prep (1)     [100%] 1 of 1 ✔
[-        ] process > tsd_search       -
[-        ] process > tsd_report       -

The .command.sh:

#!/bin/bash -ue
ls *.vcf > vcfs.txt
SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100 genotypes.vcf
repmask_vcf.sh genotypes.vcf genotypes_repmasked.vcf.gz combi_repmod_repbase_26_01_dfam_3_5_insecta.lib
bcftools view -G genotypes_repmasked.vcf.gz |     awk -v FS='   ' -v OFS='  '     '{if($0 ~ /#CHROM/) {$9 = "FORMAT"; $10 = "ref"; print $0} else if(substr($0, 1, 1) == "#") {print $0} else {$9 = "GT"; $10 = "1|0"; print $0}}' |     awk 'NR==1{print; print "##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">"} NR!=1' |     bcftools view -i 'INFO/total_match_span > 0.80' -o genotypes_repmasked_temp.vcf
fix_vcf.py --ref hifiasm_scaff10x_arks.fa.masked --vcf_in genotypes_repmasked_temp.vcf --vcf_out genotypes_repmasked_filtered.vcf

clemgoub commented 1 year ago

Hello!

There might be an issue with the output of the second process, leading to a mistake in the files made by tsd_prep. This last process must be emitting an empty channel to the next process tsd_search which is waiting for something.

Check that the vcf in 2_Repeat_Filtering has variants. If not (only a header), it could be that the process did not complete as intended, but did not run into an error and is now cached. So I would maybe want try to restart the whole thing altogether (without -resume) to see if it makes a difference.

To spare the alignment time, you may want to use a new option we have that can take a VCF as input. You will need to merge the individual VCFs present in 1_SV_search with SURVIVOR as follow:

cd 1_SV_search/
cat *.vcf > vcfs.txt
SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100 merged.vcf

Then you run GraffiTE by removing --assemblies and adding --vcf merged.vcf. It will use this VCF as input for RepeatMasker and will go on with the rest of the processes.

Let me know!

Clém

dewuem commented 1 year ago

Hello,

Thank you. I tried, however, the pipeline does not progress past [c2/6fcac3] process > tsd_prep (1) [100%] 1 of 1 ✔ regardless. I did not -resume.

.command.sh:

#!/bin/bash -ue
cp repeatmasker_dir/repeatmasker_dir/* .
prepTSD.sh hifiasm_scaff10x_arks.fa.masked 30

.command.log:

extracting flanking...
index file hifiasm_scaff10x_arks.fa.masked.fai not found, generating...
extracting SVs' 5' and 3' ends...

Folder content:

total 1111825
-rw-r--r-- 1 fr_de1013 fr_fr 120932888 Dec  1 19:26 ALL.onecode.elem_sorted.bak
-rw-r--r-- 1 fr_de1013 fr_fr     86525 Dec  1 19:26 OneCode_LTR.dic
-rw-r--r-- 1 fr_de1013 fr_fr        10 Dec  1 19:26 SV_sequences_L_R_trimmed_WIN.fa
-rw-r--r-- 1 fr_de1013 fr_fr         0 Dec  1 19:26 flanking_sequences.fasta
-rw-r--r-- 1 fr_de1013 fr_fr      5176 Dec  1 19:26 gLength.txt
lrwxrwxrwx 1 fr_de1013 fr_fr       132 Dec  1 19:26 genotypes_repmasked_filtered.vcf -> [my path]/work/bc/df1882c112dc9c24e3bebd9d464a51/genotypes_repmasked_filtered.vcf
lrwxrwxrwx 1 fr_de1013 fr_fr        99 Dec  1 19:26 hifiasm_scaff10x_arks.fa.masked -> [my path]/hifiasm_scaff10x_arks.fa.masked
-rw-r--r-- 1 fr_de1013 fr_fr     15040 Dec  1 19:26 hifiasm_scaff10x_arks.fa.masked.fai
-rw-r--r-- 1 fr_de1013 fr_fr 225140303 Dec  1 19:26 indels.fa.cat.gz
-rw-r--r-- 1 fr_de1013 fr_fr 240145914 Dec  1 19:26 indels.fa.masked
-rw-r--r-- 1 fr_de1013 fr_fr  73953562 Dec  1 19:26 indels.fa.onecode.out
-rw-r--r-- 1 fr_de1013 fr_fr 119396982 Dec  1 19:26 indels.fa.out
-rw-r--r-- 1 fr_de1013 fr_fr    130131 Dec  1 19:26 indels.fa.out.length
-rw-r--r-- 1 fr_de1013 fr_fr 179322798 Dec  1 19:26 indels.fa.out.log.txt
-rw-r--r-- 1 fr_de1013 fr_fr      2480 Dec  1 19:26 indels.fa.tbl
-rw-r--r-- 1 fr_de1013 fr_fr         0 Dec  1 19:26 indels.txt
-rw-r--r-- 1 fr_de1013 fr_fr         0 Dec  1 19:26 oneHit_SV_coordinates.bed
-rw-r--r-- 1 fr_de1013 fr_fr         0 Dec  1 19:26 oneHit_SV_coordinates_win.bed
-rw-r--r-- 1 fr_de1013 fr_fr         0 Dec  1 19:26 oneHit_indels.fa.masked
-rw-r--r-- 1 fr_de1013 fr_fr 179323206 Dec  1 19:26 onecode.log
drwxr-xr-x 2 fr_de1013 fr_fr      8192 Dec  1 23:22 repeatmasker_dir

Cheers Daniel

clemgoub commented 1 year ago

Hello Daniel,

Sorry for the delays, I was travelling. Can you email me your VCF?

Cheers,

Clém

clemgoub commented 1 year ago

Hey Dan,

I checked your files. Indeed the first vcfs generated had no variants, something must have happened. The manually merged vcf seems good to run with --vcf however, you send me a .gz compressed version, it need to be uncompressed to be used by GraffiTE for now. Can you confirm/refute this hypothesis?

In any case, I recommend to update the github repos (the image is unchanged however) because I fixed some small bugs recently when the pipeline runs without the --mammal flag.

Cheers,

Clém

dewuem commented 1 year ago

Hi Clém,

I just compressed the file manually to send to you. Trying the update. Thanks :)

Cheers D

On Tue, 13 Dec 2022, 11:33 Clément Goubert, @.***> wrote:

Hey Dan,

I checked your files. Indeed the first vcfs generated had no variants, something must have happened. The manually merged vcf seems good to run with --vcf however, you send me a .gz compressed version, it need to be uncompressed to be used by GraffiTE for now. Can you confirm/refute this hypothesis?

In any case, I recommend to update the github repos (the image is unchanged however) because I fixed some small bugs recently when the pipeline runs without the --mammal flag.

Cheers,

Clém

— Reply to this email directly, view it on GitHub https://github.com/cgroza/GraffiTE/issues/8#issuecomment-1348153035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6GYR74SHREWUYK5KHGIS3WNBGGJANCNFSM6AAAAAASHU7EYA . You are receiving this because you authored the thread.Message ID: @.***>

coevolutionlab commented 1 year ago

Thanks for the tip, I ended up removing the "--contain" flag altogether, since it always seemed to insist on going to the $TMPDIR path whatever I tried to bind. That worked, however, now the pipeline stops after the first TSD step. The job finished as successful, no error was generated. But the pipeline is not complete I think, as there is no 3_... folder in the output. The last result folder generated is 2_Repeat_Filtering with genotypes_repmasked_filtered.vcf repeatmasker_dir

The job output:
executor >  local (1)
[ba/f7b952] process > svim_asm (5)     [100%] 16 of 16, cached: 16 ✔
[06/c2782e] process > repeatmasker (1) [100%] 1 of 1, cached: 1 ✔
[bd/7be51f] process > tsd_prep (1)     [100%] 1 of 1 ✔
[-        ] process > tsd_search       -
[-        ] process > tsd_report       -
The .command.sh:
#!/bin/bash -ue
ls *.vcf > vcfs.txt
SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100 genotypes.vcf
repmask_vcf.sh genotypes.vcf genotypes_repmasked.vcf.gz combi_repmod_repbase_26_01_dfam_3_5_insecta.lib
bcftools view -G genotypes_repmasked.vcf.gz |     awk -v FS=' ' -v OFS='  '     '{if($0 ~ /#CHROM/) {$9 = "FORMAT"; $10 = "ref"; print $0} else if(substr($0, 1, 1) == "#") {print $0} else {$9 = "GT"; $10 = "1|0"; print $0}}' |     awk 'NR==1{print; print "##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">"} NR!=1' |     bcftools view -i 'INFO/total_match_span > 0.80' -o genotypes_repmasked_temp.vcf
fix_vcf.py --ref hifiasm_scaff10x_arks.fa.masked --vcf_in genotypes_repmasked_temp.vcf --vcf_out genotypes_repmasked_filtered.vcf

Hi Clém I am in the same boat (I get exactly the same output and the GraffiTE ends at tsd_prep step. I believe my problem is not with tmp directory. My genotypes_repmasked_filtered.vcf file is also very similar. There is no record of any variants.

Best regards, Gautam

dewuem commented 1 year ago

Hi all, happy new year!

An update from me, as Clém suggested, the issue was in the scaffold names for me. Specifically there were commas. One of the scaffolding programs created commas in the fasta sequence headers which then some program in the pipeline could not deal with. Replacing the commas with underscores fixed the issue. The pipeline completed successfully. Thanks for all your help.

Cheers Daniel

coevolutionlab commented 1 year ago

Hi all, Happy new year to you too! @dewuem : I am glad this fix worked for you, and thanks for suggesting this issue with contig names. I also tried fixing contig names. Underscores did not work, but plain numbering them like chr1, chr3, ..... worked. I am now in genotyping stage (with Vg). Thanks to both of you for all the discussion and various suggestions.

@cgroza and @clemgoub : Thanks for developing such a great pipeline (with a singularity container)!!

Cheers, Gautam

clemgoub commented 1 year ago

Hello all and Happy New year!

I am so glad you could fix your issues! Your feedback is important to improve the pipeline documentation! Please let us know how things go and if you need further help!

Cheers,

Clément

clemgoub commented 1 year ago

copying here my message to @dewuem about the scaffold/contig/chromosome name issues:

hooo I think bcftools doesn’t like the contig’s header. I found this at the end of the repeatmasker log, which correspond to the step where we annotate the vcf with the repeats:

[W::vcf_parse] Contig 'scaffold1,52763537,f1Z52763537' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=scaffold1,52763537,f1Z52763537>"
[E::vcf_parse] Could not add dummy header for contig 'scaffold1,52763537,f1Z52763537'

I think it fixes the first line by itself, but it might no like the header. To test, I would only use 1 contig (maybe the longest) and rename it without comma, then restart from the top of the pipeline (--assemblies and not --vcf) to be sure.

clemgoub commented 1 year ago

cgroza / GraffiTE

Possible issue related to $TMPDIR #8