DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
686 stars 267 forks source link

How to resume a "masking low-complexity regions of downloaded library": add --continue options to resume process #660

Open MaryoHg opened 1 year ago

MaryoHg commented 1 year ago

Dear @jenniferlu717 and @DerrickWood

Is there any way to resume a nearly-finished process for --download-library? I was downloading the "bacteria" library (the biggest one, nearly 140 GB) and I got broken pipe.

$ kraken2-build --download-library bacteria --db ./bigDB --threads 24

however, the connection was lost: sent 4113433 bytes received 42575105677 bytes 5103892.01 bytes/sec total size is 42557972622 speedup is 1.00 Rsync file transfer complete. Step 2/2: Assigning taxonomic IDs to sequences Processed 35020 projects (83305 sequences, 146.14 Gbp)... done. All files processed, cleaning up extra sequence files... done, library complete. Masking low-complexity regions of downloaded library...client_loop: send disconnect: Broken pipe

How can I resume this process? It took me (many) hours to accomplish the download (internet connection is terrible now)

Perhaps adding a --continue (as megahit assembler) or --resume option to the scripts?

I manually change download_from_ncbi.pl script to rsync --progress --partial --verbose however, the process can not be resumed to avoid downloading all genomes (AFAIK).

Thanks in advance for your time and for developing and maintaining kraken2. regards, Maryo.

MaryoHg commented 1 year ago

So, instead of running again the kraken2-build --download-library bacteria --threads 24 command, I rather just do what's needed: ¡Run dustmasker on library.fna'.

I found thread #449, and decided masking low complexity regions on the library downloaded, i.e., library.fna file into the bigDB/library/bacteria/ folder, running the following code:

$ dustmasker -in ./library.fna -outfmt fasta | sed -e '/^>/!s/[a-z]/x/g' > library.fna.tmp $ mv library.fna.tmp library.fna

I noticed that took nearly 30 minutes to create a masked file with 30GB in size. The original file is nearly 140, so it will take nearly ~4 hours to mask the total library.

I appreciate if you shine a light on me on this matter! Thanks. Maryo.

aimirza commented 3 months ago

My build process also broke during the masking process for the nt database. Please advise how we can properly continue the masking processes without starting over again.

ch4rr0 commented 3 months ago

Hello,

I ran into a similar issue not too long ago while building NT. I have augmented k2mask with the ability to resume a failed masking attempt. This process is not 100% fullproof though. A failed masked attempt will create a masked_sequences.txt file that the k2mask will reference the next time the masking process begins. It is possible that the process can get terminated before a masked sequence got completely written to disk and therefore before the sequence name gets written to masked_sequences.txt. It it always advised to review the file and delete the sequence whose name is not present in masked_sequences.txt before resuming the process.

Here's what the output looks like when resuming a mask of the human library:

./k2mask -in ../scripts/foo/library/human/library.fna -o library.fna -threads 4
Sequence: >kraken:taxid|9606|NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly already masked... skipping.
Sequence: >kraken:taxid|9606|NT_187361.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG1_UNLOCALIZED already masked... skipping.
Sequence: >kraken:taxid|9606|NT_187362.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG2_UNLOCALIZED already masked... skipping.
Sequence: >kraken:taxid|9606|NT_187363.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG3_UNLOCALIZED already masked... skipping.
Sequence: >kraken:taxid|9606|NT_187364.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG4_UNLOCALIZED already masked... skipping.
Sequence: >kraken:taxid|9606|NT_187365.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG5_UNLOCALIZED already masked... skipping.
Sequence: >kraken:taxid|9606|NT_187366.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG6_UNLOCALIZED already masked... skipping.
Sequence: >kraken:taxid|9606|NT_187367.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG7_UNLOCALIZED already masked... skipping.
Sequence: >kraken:taxid|9606|NT_187368.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG8_UNLOCALIZED already masked... skipping.
Sequence: >kraken:taxid|9606|NT_187369.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR1_CTG9_UNLOCALIZED already masked... skipping.
Sequence: >kraken:taxid|9606|NC_000002.12 Homo sapiens chromosome 2, GRCh38.p14 Primary Assembly already masked... skipping.
Sequence: >kraken:taxid|9606|NT_187370.1 Homo sapiens chromosome 2 unlocalized genomic scaffold, GRCh38.p14 Primary Assembly HSCHR2_RANDOM_CTG1 already masked... skipping.

I will add, at a later time, functionality to the wrapper scripts to deal with the issue highlighted above, as I imagine it can be quite time consuming to remove a single sequence at the tail of a multi-gigabyte file.

FYI -- k2mask unlike dustmasker can run multi-threaded while maintaining order. This should help in reducing the masking time significantly.

aimirza commented 3 months ago

I wrote a script that solved the problem without starting over.

This example uses library.fna as the original fasta library and library.fna.tmp as the incomplete masked file produces from kraken2 (actually from dustmasker). Replace these file names with the actual files.

# Step 1: Save the last header from the tmp file
last_header=$(tac library.fna.tmp | grep '^>' -m1) 
echo "Found last header"

# Step 2: Remove the header and everything after in place, just in case dustmasker didnt finish masking the last sequence. 
sed -i "/$last_header/,\$d" library.fna.tmp
echo "Removed the header and everything after in place"

# Step 3: Use seqkit to split the original fasta file into 20 parts after and including the last header
awk -v header="$last_header" '$0 == header,EOF' library.fna | seqkit split2 --by-part 20 -O splits
echo "Split complete"

# Step 4: Mask each split file
for split_fa in splits/*.fasta; do
  dustmasker -in "$split_fa" -outfmt fasta | sed -e '/^>/!s/[a-z]/x/g' > "${split_fa}.tmp"
done

Step 4 Alternative: Running on SLURM. Instead, you can run these on SLURM in parallel with a limit of, for example, 10 jobs simultaneously. For example, in the same script you can use:

echo "starting masking jobs"
#!/bin/bash
#SBATCH --job-name=mask-splits
#SBATCH --array=0-19%10
#SBATCH --mem=8G
#SBATCH --cpus-per-task=1
#SBATCH --time=04:00:00
#SBATCH --output=process-splits-%A_%a.out
#SBATCH --error=process-splits-%A_%a.err

# Activate the environment
conda activate kraken2

# Calculate the split file name based on the SLURM_ARRAY_TASK_ID
SPLIT_FILES=(splits/*.fasta)
SPLIT_FILE=${SPLIT_FILES[$SLURM_ARRAY_TASK_ID]}

# Check if the split file exists
if [[ -f "$SPLIT_FILE" ]]; then
  # Mask the split file
  dustmasker -in "$SPLIT_FILE" -outfmt fasta | sed -e '/^>/!s/[a-z]/x/g' > "${SPLIT_FILE}.tmp"
else
  echo "Split file does not exist: $SPLIT_FILE"
fi

conda deactivate

Step 5 (To be run after all Slurm jobs are finished): Append all processed files to the original tmp file.

cat splits/*.fasta.tmp >> library.fna.tmp
mv library.fna.tmp library.fna # carefull with this step. You may want to change the name of the original library.fna
touch library.fna.masked # Empty status file