flass / pantagruel

a pipeline for reconciliation of phylogenetic histories within a bacterial pangenome
GNU General Public License v3.0
46 stars 7 forks source link

Permission denied to delete assembly files after failed fetch #47

Closed ezherman closed 2 years ago

ezherman commented 2 years ago

Hi Florent,

I was hoping to get your advice on issues I'm having when downloading assemblies using the -L flag. There are two issues:

I initiated within a Singularity container using the following command:

pantagruel -d pseudomonas_test_db -r ./ -f PANTAGFAM -I elh605 -L refseq_test_list init

The file refseq_test_list contains:

GCF_013201115.1
GCF_000981825.1
GCF_002208645.1
GCF_002085605.1
GCF_009676765.1

I then run the full workflow:

pantagruel -i pseudomonas_test_db/environ_pantagruel_pseudomonas_test_db.sh -N 4 all

The download then fails with GCF_002208645.1:

This is Pantagruel pipeline version 5ee7a269e214dff585cc5cfa97e970a1ffc38a0c using source code from repository '/pantagruel' (branch: 'master')

will try and execute processes in parallel with the following number of threads: 4
# will run tasks: 0 1 2 3 4 5 6 7 8 9
[2022-02-17 15:24:16] Pantagruel pipeline task 0: fetch public genome data from NCBI sequence databases and annotate private genomes.
Create new task folder '/mnt/lustre/users/elh605/pantagruel_pipeline/pseudomonas_test_db/00.input_data'
[2022-02-17 15:24:17] did not find the relevant NCBI Taxonomy flat files in '/mnt/lustre/users/elh605/pantagruel_pipeline/NCBI/Taxonomy_2022-02-17/'; download the from NCBI Taxonomy FTP
succesfully downloaded  NCBI Taxonomy flat files in '/mnt/lustre/users/elh605/pantagruel_pipeline/NCBI/Taxonomy_2022-02-17/'
taxcat.tar.gz: FAILED
md5sum: WARNING: 1 computed checksum did NOT match
taxdump.tar.gz: FAILED
md5sum: WARNING: 1 computed checksum did NOT match
/users/elh605/scratch/pantagruel_pipeline
[2022-02-17 15:24:56] fetch assembly data from NCBI FTP accordng to list '/mnt/lustre/users/elh605/pantagruel_pipeline/RefSeq_accession_ids'
# input Assembly accession id list: '/mnt/lustre/users/elh605/pantagruel_pipeline/RefSeq_accession_ids'
# output folder: '/mnt/lustre/users/elh605/pantagruel_pipeline/pseudomonas_test_db/00.input_data/RefSeq_accession_ids_assemblies_from_ftp'
fetch /genomes/all/GCF/013/201/115/GCF_013201115.1_ASM1320111v1

Total: 5 directories, 20 files, 2 symlinks                                                                                                        
New: 20 files, 2 symlinks
19910438 bytes transferred in 77 seconds (250.9 KiB/s)
total 4108
-r--r--r-- 1 elh605 clusterusers    1382 Jun  1  2020 GCF_013201115.1_ASM1320111v1_assembly_report.txt
-r--r--r-- 1 elh605 clusterusers    5624 Aug 11  2021 GCF_013201115.1_ASM1320111v1_assembly_stats.txt
dr-xr-xr-x 3 elh605 clusterusers    4096 Aug 11  2021 GCF_013201115.1_ASM1320111v1_assembly_structure
-r--r--r-- 1 elh605 clusterusers 2297770 Aug 11  2021 GCF_013201115.1_ASM1320111v1_cds_from_genomic.fna.gz
-r--r--r-- 1 elh605 clusterusers     276 Jun 26  2021 GCF_013201115.1_ASM1320111v1_feature_count.txt.gz
-r--r--r-- 1 elh605 clusterusers  325637 Aug 11  2021 GCF_013201115.1_ASM1320111v1_feature_table.txt.gz
-r--r--r-- 1 elh605 clusterusers 2171762 Jun  1  2020 GCF_013201115.1_ASM1320111v1_genomic.fna.gz
-r--r--r-- 1 elh605 clusterusers 5258481 Aug 11  2021 GCF_013201115.1_ASM1320111v1_genomic.gbff.gz
-r--r--r-- 1 elh605 clusterusers  473973 Aug 11  2021 GCF_013201115.1_ASM1320111v1_genomic.gff.gz
-r--r--r-- 1 elh605 clusterusers  572969 Aug 11  2021 GCF_013201115.1_ASM1320111v1_genomic.gtf.gz
-r--r--r-- 1 elh605 clusterusers 1418291 Aug 11  2021 GCF_013201115.1_ASM1320111v1_protein.faa.gz
-r--r--r-- 1 elh605 clusterusers 3523055 Aug 11  2021 GCF_013201115.1_ASM1320111v1_protein.gpff.gz
-r--r--r-- 1 elh605 clusterusers    5661 Jun 26  2021 GCF_013201115.1_ASM1320111v1_rna_from_genomic.fna.gz
-r--r--r-- 1 elh605 clusterusers 1681289 Aug 11  2021 GCF_013201115.1_ASM1320111v1_translated_cds.faa.gz
lrwxrwxrwx 1 elh605 clusterusers      25 Feb 17 15:25 README.txt -> ../../../../../README.txt
-r--r--r-- 1 elh605 clusterusers     410 Aug 11  2021 annotation_hashes.txt
-r--r--r-- 1 elh605 clusterusers      14 Feb 14 07:38 assembly_status.txt
-r--r--r-- 1 elh605 clusterusers    1683 Aug 11  2021 md5checksums.txt
./annotation_hashes.txt: OK
./GCF_013201115.1_ASM1320111v1_assembly_report.txt: OK
./GCF_013201115.1_ASM1320111v1_assembly_stats.txt: OK
./GCF_013201115.1_ASM1320111v1_assembly_structure/Primary_Assembly/assembled_chromosomes/AGP/chr.comp.agp.gz: OK
./GCF_013201115.1_ASM1320111v1_assembly_structure/Primary_Assembly/assembled_chromosomes/chr2acc: OK
./GCF_013201115.1_ASM1320111v1_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr.fna.gz: OK
./GCF_013201115.1_ASM1320111v1_assembly_structure/Primary_Assembly/component_localID2acc: OK
./GCF_013201115.1_ASM1320111v1_cds_from_genomic.fna.gz: OK
./GCF_013201115.1_ASM1320111v1_feature_count.txt.gz: OK
./GCF_013201115.1_ASM1320111v1_feature_table.txt.gz: OK
./GCF_013201115.1_ASM1320111v1_genomic.fna.gz: OK
./GCF_013201115.1_ASM1320111v1_genomic.gbff.gz: OK
./GCF_013201115.1_ASM1320111v1_genomic.gff.gz: OK
./GCF_013201115.1_ASM1320111v1_genomic.gtf.gz: OK
./GCF_013201115.1_ASM1320111v1_protein.faa.gz: OK
./GCF_013201115.1_ASM1320111v1_protein.gpff.gz: OK
./GCF_013201115.1_ASM1320111v1_rna_from_genomic.fna.gz: OK
./GCF_013201115.1_ASM1320111v1_translated_cds.faa.gz: OK
GCF_013201115.1: done

fetch /genomes/all/GCF/000/981/825/GCF_000981825.1_ASM98182v1

New: 16 files, 1 symlink                                                                                
17759814 bytes transferred in 14 seconds (1.23 MiB/s)
total 4104
-r--r--r-- 1 elh605 clusterusers    1291 Dec 10  2019 GCF_000981825.1_ASM98182v1_assembly_report.txt
-r--r--r-- 1 elh605 clusterusers    5210 Sep  8 03:29 GCF_000981825.1_ASM98182v1_assembly_stats.txt
-r--r--r-- 1 elh605 clusterusers 2293112 Nov 13 09:24 GCF_000981825.1_ASM98182v1_cds_from_genomic.fna.gz
-r--r--r-- 1 elh605 clusterusers     305 Nov 13 09:24 GCF_000981825.1_ASM98182v1_feature_count.txt.gz
-r--r--r-- 1 elh605 clusterusers  330801 Nov 13 09:24 GCF_000981825.1_ASM98182v1_feature_table.txt.gz
-r--r--r-- 1 elh605 clusterusers 2150165 Dec 18  2020 GCF_000981825.1_ASM98182v1_genomic.fna.gz
-r--r--r-- 1 elh605 clusterusers 5251664 Nov 13 09:24 GCF_000981825.1_ASM98182v1_genomic.gbff.gz
-r--r--r-- 1 elh605 clusterusers  482703 Nov 13 09:24 GCF_000981825.1_ASM98182v1_genomic.gff.gz
-r--r--r-- 1 elh605 clusterusers  580544 Nov 13 09:24 GCF_000981825.1_ASM98182v1_genomic.gtf.gz
-r--r--r-- 1 elh605 clusterusers 1418836 Nov 13 09:24 GCF_000981825.1_ASM98182v1_protein.faa.gz
-r--r--r-- 1 elh605 clusterusers 3556208 Nov 13 09:24 GCF_000981825.1_ASM98182v1_protein.gpff.gz
-r--r--r-- 1 elh605 clusterusers    5641 Dec 18  2020 GCF_000981825.1_ASM98182v1_rna_from_genomic.fna.gz
-r--r--r-- 1 elh605 clusterusers 1681790 Nov 13 09:24 GCF_000981825.1_ASM98182v1_translated_cds.faa.gz
lrwxrwxrwx 1 elh605 clusterusers      25 Feb 17 15:25 README.txt -> ../../../../../README.txt
-r--r--r-- 1 elh605 clusterusers     410 Nov 13 09:24 annotation_hashes.txt
-r--r--r-- 1 elh605 clusterusers      14 Feb 14 05:14 assembly_status.txt
-r--r--r-- 1 elh605 clusterusers    1120 Nov 13 09:24 md5checksums.txt
./annotation_hashes.txt: OK
./GCF_000981825.1_ASM98182v1_assembly_report.txt: OK
./GCF_000981825.1_ASM98182v1_assembly_stats.txt: OK
./GCF_000981825.1_ASM98182v1_cds_from_genomic.fna.gz: OK
./GCF_000981825.1_ASM98182v1_feature_count.txt.gz: OK
./GCF_000981825.1_ASM98182v1_feature_table.txt.gz: OK
./GCF_000981825.1_ASM98182v1_genomic.fna.gz: OK
./GCF_000981825.1_ASM98182v1_genomic.gbff.gz: OK
./GCF_000981825.1_ASM98182v1_genomic.gff.gz: OK
./GCF_000981825.1_ASM98182v1_genomic.gtf.gz: OK
./GCF_000981825.1_ASM98182v1_protein.faa.gz: OK
./GCF_000981825.1_ASM98182v1_protein.gpff.gz: OK
./GCF_000981825.1_ASM98182v1_rna_from_genomic.fna.gz: OK
./GCF_000981825.1_ASM98182v1_translated_cds.faa.gz: OK
GCF_000981825.1: done

fetch /genomes/all/GCF/002/208/645/GCF_002208645.1_ASM220864v1

New: 16 files, 1 symlink                                                                                  
19803820 bytes transferred in 14 seconds (1.36 MiB/s)
total 4104
-r--r--r-- 1 elh605 clusterusers    1379 Dec 10  2019 GCF_002208645.1_ASM220864v1_assembly_report.txt
-r--r--r-- 1 elh605 clusterusers    5587 Dec 20 22:32 GCF_002208645.1_ASM220864v1_assembly_stats.txt
-r--r--r-- 1 elh605 clusterusers 2257388 Dec 20 22:32 GCF_002208645.1_ASM220864v1_cds_from_genomic.fna.gz
-r--r--r-- 1 elh605 clusterusers     277 Dec 20 22:32 GCF_002208645.1_ASM220864v1_feature_count.txt.gz
-r--r--r-- 1 elh605 clusterusers  323071 Dec 20 22:32 GCF_002208645.1_ASM220864v1_feature_table.txt.gz
-r--r--r-- 1 elh605 clusterusers 2123073 Oct 14  2017 GCF_002208645.1_ASM220864v1_genomic.fna.gz
-r--r--r-- 1 elh605 clusterusers 7481811 Dec 20 22:32 GCF_002208645.1_ASM220864v1_genomic.gbff.gz
-r--r--r-- 1 elh605 clusterusers  471611 Dec 20 22:32 GCF_002208645.1_ASM220864v1_genomic.gff.gz
-r--r--r-- 1 elh605 clusterusers  570353 Dec 20 22:32 GCF_002208645.1_ASM220864v1_genomic.gtf.gz
-r--r--r-- 1 elh605 clusterusers 1402762 Dec 20 22:32 GCF_002208645.1_ASM220864v1_protein.faa.gz
-r--r--r-- 1 elh605 clusterusers 3504965 Dec 20 22:32 GCF_002208645.1_ASM220864v1_protein.gpff.gz
-r--r--r-- 1 elh605 clusterusers    5679 Dec 22  2020 GCF_002208645.1_ASM220864v1_rna_from_genomic.fna.gz
-r--r--r-- 1 elh605 clusterusers 1654307 Dec 20 22:32 GCF_002208645.1_ASM220864v1_translated_cds.faa.gz
lrwxrwxrwx 1 elh605 clusterusers      25 Feb 17 15:25 README.txt -> ../../../../../README.txt
-r--r--r-- 1 elh605 clusterusers     410 Dec 20 22:32 annotation_hashes.txt
-r--r--r-- 1 elh605 clusterusers      14 Feb 14 05:42 assembly_status.txt
-r--r--r-- 1 elh605 clusterusers    1133 Dec 20 22:32 md5checksums.txt
./annotation_hashes.txt: OK
./GCF_002208645.1_ASM220864v1_assembly_report.txt: OK
./GCF_002208645.1_ASM220864v1_assembly_stats.txt: OK
./GCF_002208645.1_ASM220864v1_cds_from_genomic.fna.gz: OK
./GCF_002208645.1_ASM220864v1_feature_count.txt.gz: OK
./GCF_002208645.1_ASM220864v1_feature_table.txt.gz: OK
./GCF_002208645.1_ASM220864v1_genomic.fna.gz: OK
./GCF_002208645.1_ASM220864v1_genomic.gbff.gz: FAILED
./GCF_002208645.1_ASM220864v1_genomic.gff.gz: OK
./GCF_002208645.1_ASM220864v1_genomic.gtf.gz: OK
./GCF_002208645.1_ASM220864v1_protein.faa.gz: OK
./GCF_002208645.1_ASM220864v1_protein.gpff.gz: OK
./GCF_002208645.1_ASM220864v1_rna_from_genomic.fna.gz: OK
./GCF_002208645.1_ASM220864v1_translated_cds.faa.gz: OK
md5sum: WARNING: 1 computed checksum did NOT match
Error: files in /GCF_002208645.1_ASM220864v1/ seem corrupted (not only about missing *assembly_structure/ files)
exit now
ERROR: could not fetch all the genomes ; exit now
ERROR: Pantagruel pipeline task 0: failed.

Following this, I am unable to delete the pseudomonas_test_db folder using rm -rf pseudomonas_test_db/:

rm: cannot remove 'pseudomonas_test_db/00.input_data/RefSeq_accession_ids_assemblies_from_ftp/GCF_013201115.1_ASM1320111v1/GCF_013201115.1_ASM1320111v1_assembly_structure/README.txt': Permission denied
rm: cannot remove 'pseudomonas_test_db/00.input_data/RefSeq_accession_ids_assemblies_from_ftp/GCF_013201115.1_ASM1320111v1/GCF_013201115.1_ASM1320111v1_assembly_structure/Primary_Assembly/assembled_chromosomes/AGP/chr.comp.agp.gz': Permission denied
rm: cannot remove 'pseudomonas_test_db/00.input_data/RefSeq_accession_ids_assemblies_from_ftp/GCF_013201115.1_ASM1320111v1/GCF_013201115.1_ASM1320111v1_assembly_structure/Primary_Assembly/assembled_chromosomes/chr2acc': Permission denied
rm: cannot remove 'pseudomonas_test_db/00.input_data/RefSeq_accession_ids_assemblies_from_ftp/GCF_013201115.1_ASM1320111v1/GCF_013201115.1_ASM1320111v1_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr.fna.gz': Permission denied
rm: cannot remove 'pseudomonas_test_db/00.input_data/RefSeq_accession_ids_assemblies_from_ftp/GCF_013201115.1_ASM1320111v1/GCF_013201115.1_ASM1320111v1_assembly_structure/Primary_Assembly/component_localID2acc': Permission denied

Giving myself write and execute permissions over the files and the pseudomonas_test_db directory did not solve the issue. I am able to delete the files on my personal computer with sudo, however I cannot do this on my university's HPC cluster.

I was wondering whether:

Thanks in advance!

flass commented 2 years ago

Hi Ezra,

thank you for reporting this bug.

It seems some of the issue you're getting (write permissions, potentially unstable network connectivity leading to failed downloads) could be linked to the container implementation. Unfortunately, I have never used pantagruel from a Singularity container, or at least not this version. The Dockerfile has been designed for building a Docker image and only tested as a Docker container. I understand that Docker may not be supported on your computer cluster for security issues; in fact, the Docker image installation option has been made mostly for usage on a cloud. That said, our IT team at the Sanger has made a Singularity image for pantagruel, which (I think) was working correctly last time I checked. I could try and see what they have done to the Dockerfile or what options were used when building the image to make it work as a Singularity container.

Could you please share the commands you used to build the image and to run the container?

With regards to the files being not within reach of write permissions, there is probably something (either a folder or some file) where permissions belong to the root user from the container (even though everything shows as belonging to you). if that is so, once the container exits, i'm not sure you can even re-access these files through another container, as the root users would technically be different users... that's me making hypotheses here, as I don't have deep knoledge of container system, expecially Singularity ones. I can only suggest two options:

I hope this helps you a bit.

Best wishes,

Florent

ezherman commented 2 years ago

Hi Florent,

Thank you for offering to check with the IT team how the Singularity container was implemented. That would be really helpful.

I am running Singularity 3.8.5 in a conda environment on Ubuntu 20.04 through Windows Subsystem for Linux. When using the Slurm HPC (University of York's Viking cluster), I am using Singularity 3.5.3, installed by the IT team.

I am unable to reproduce the failure to delete pseudomonas_test_db. Now when the fetching fails, the folder can be deleted with rm -rf. I am sorry that I can't provide you with more details on this issue!

I am however able to reproduce the failure to fetch assemblies. After running the pantagruel -i (...) all command a few times, it seems like the failure is not associated with a particular assembly. In fact, I even had a couple of fetches that completed without any issues. Could it be that a temporary network failure causes the fetch command to fail?

I have now encountered a new issue: if fetching does complete, pantagruel freezes at task 2. Is this something you've encountered before? See the output below.

[2022-02-21 14:28:08] Pantagruel pipeline task 2: align homologous protein sequences and reverse-translate aligned proteins into aligned coding sequences.
Create new task folder '/mnt/c/Users/elh605/pantagruel_test_loc/pantagruel_pipeline/pseudomonas_test_db/02.gene_alignments'
[2022-02-21 14:28:13]-- step 1: alignment of nr protein families with clustalo and GNU parallel
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.

Hereby the commands used to run the container and pantagruel:

mkdir pantagruel_pipeline
cd pantagruel_pipeline
git clone --recurse-submodules https://github.com/flass/pantagruel.git

singularity pull docker://flass/pantagruel-dep:master-latest
singularity run -B $PWD:$PWD pantagruel-dep_master-latest.sif

pantagruel/install_interproscan.sh $PWD/ $PWD/

# run with pseudomonas assemblies, which sometimes fetches without any issues
# however pantagruel freezes in step 1 of task 2 (alignment of nr protein families with clustalo and GNU parallel)
pantagruel -d pseudomonas_test_db -r ./ -f PANTAGFAM -I elh605 -L refseq_list_test init
pantagruel -i ./pseudomonas_test_db/environ_pantagruel_pseudomonas_test_db.sh all
flass commented 2 years ago

Hi Ezra,

OK - it's good news that the file permission issue is not occurring any more.

The file fetching problem is likely something to do with instable network, which is in turn likely to do with the interface between the singularity container and the lftp software. It could be that some parameterising of the singularity run or image building can solve it... I've tested again the singularity installation on our cluster at Sanger and this steps completes fine, so there might be a trick to make it work stably (it could also be something at the level of how the whole cluster is parameterised and how it allows singularity containers to access the internet... maybe you can ask your IT admin team).

Regarding the block in task 2, that's a completely different bug. Not really a bug in fact, just the need to run once the command parallel --citation on the system to acknowledge the authors before you can use the parallel command freely. in a container environment it is a problem because every container is like a new system so it has to be acknowledged every time a new container is created/run. It's something that came up in GNU parallel after I last built the Docker image, but your Singularity image has it. I've introduce automated acknowledgement in commits 1c47117 and 501e894 (master) and 8d8c10f (usingGeneRax). So if you re-build your Singularity image now with last code it should work fine - or if not please let me know, but ideally in a separate issue.

I consider this issue resolved, however I'll try to document on how to have stable network using Singularity, and post again here if I find.

ezherman commented 2 years ago

Thank you Florent!