artic-network / artic-ncov2019

ARTIC nanopore protocol for nCoV2019 novel coronavirus
Creative Commons Attribution 4.0 International
168 stars 166 forks source link

nanopolish index still running after 24 hours #59

Open RichardCorbett opened 3 years ago

RichardCorbett commented 3 years ago

Hi all,

We are sequencing minION flowcells using the ARTIC amplicons and beginning to test out the ARTIC workflow as described on this page:

ARTIC SOP 2019

Our data are sequenced as a pool of 96 samples (including some negative and positive controls), basecalled with Guppy, and index split using the approach outlined in the SOP above.

Likely as expected, the guppyplex command is efficient and runs in just a minute or so on each barcode. This gives us 96 filtered fastq files with read counts ranging from 0 to 60,000.

When I run the minion command like this (one job per barcode): artic minion --normalise 200 --threads 12 --scheme-directory ~/bin/artic-ncov2019/primer_schemes/ --read-file plex_FAP90847_AAGGTTACACAAACCCTGGACAAG_pass_concat.fastq --fast5-directory fast5 --sequencing-summary sequencing_summary.txt nCoV-2019/V3 plex_FAP90847_AAGGTTACACAAACCCTGGACAAG_pass_concat

It fires up the jobs without error, but after the first few minutes the output doesn't seem to be updated though the jobs don't crash or finish. I can see that the nanopolish index command is running and registering on the CPUs of my server, and there doesn't appear to be any iowat on the nodes, suggesting there isn't a network bottleneck slowing things down.

Within the first couple of minutes of the nanopolish index commands firing up I get files like these:

-rw-r--r-- 1 rcorbett users    0 Mar 13 13:08 plex_FAP90847_ACCACTGCCATGTATCAAAGTACG_pass_concat.minion.log.txt
-rw-r--r-- 1 rcorbett users 7954 Mar 13 13:08 plex_FAP90847_ACCACTGCCATGTATCAAAGTACG_pass_concat.fastq.index
-rw-r--r-- 1 rcorbett users    7 Mar 13 13:08 plex_FAP90847_ACCACTGCCATGTATCAAAGTACG_pass_concat.fastq.index.gzi
drwxr-sr-x 2 rcorbett users    4 Mar 13 13:08 .
-rw-r--r-- 1 rcorbett users 2962 Mar 13 13:08 plex_FAP90847_ACCACTGCCATGTATCAAAGTACG_pass_concat.fastq.index.fai

The jobs have run for another 24 hours without any more output going to the files. However, I see that nanopolish index has been running on my nodes and using CPU throughout.

Screen Shot 2021-03-14 at 10 42 25 AM

Do you have any ideas how I should look in to what is slowing this down? I have samples that have been analyzing for over 24 hours - Is this expected? None of the samples that have gotten on the cluster have finished.

I have used nanopolish index on entire promethion flowcells outside of the ARTIC workflow and it runs in a few hours so I expected to be able to index the minIon data much faster. Perhaps running multiple samples in parallel is known to create issues with nanopolish index ?

thanks for all your insight!

will-rowe commented 3 years ago

This looks strange - I'm not entirely sure of the issue. Pinging @jts to see if he has any insights?

Is the command you are using to run nanopolish outside of the pipeline equivalent to the call within the pipeline?

One thing to change is to use fast5/barcode as the input to --fast5-directory, but this is unlikely to be the problem here.

jts commented 3 years ago

My best guess is that there is something about the sequencing_summary.txt file that nanopolish doesn't like. I suggest running the nanopolish index command outside of the artic wrappers to see what it prints to stderr, which might help diagnose what has gone wrong.

RichardCorbett commented 3 years ago

Thanks folks, Overnight last night I was able to run a different set of data smoothly and quickly. There must be something fishy with this one flowcell of data. I tried running just one fastq file with this command:

nanopolish index -d ../fast5/ -s ../sequencing_summary.txt -v ../plex_FAP90847_ACCACTGCCATGTATCAAAGTACG_pass_concat.fastq
[readdb] indexing ../fast5/

but I don't get any more output after that. I suspect there is some mismatch between the fastq/fast5/sequencing summary files I am using. I think I can dig in from here. Sorry for the trouble.

RichardCorbett commented 3 years ago

Hi folks,

It looks like our system is doing something possibly unique.

We are loading a minION flowcell, but extracting the fastq files and sequencing summary at a timepoint in the middle of a run. At this time guppy has finished basecalling available fast5 files, but will be restarted once new fast5 files are available.

The run continues and fast5 files are generated until someone decides to shut down the sequencing run.

When I get around to starting my analysis I have the following inputs: -fastq file from the intermediate timepoint -sequencing summary file from the intermediate timepoint -fast5 files from a later timepoint

This configuration gives me the hanging nanopore index commands.

If I change the sequencing_summary.txt file to be the one created after the run completes the minion command completes in a few minutes. Is it possible that nanopore index hangs when the seuqencing_summary.txt file contains a subset of the reads contained in the fast5 files?

jts commented 3 years ago

In this case nanopolish is going to revert to the slow indexing method for the subset of fast5s that haven't been basecalled. If the run progressed well this can mean opening and reading 1000s of files, which takes awhile.

To work around this, I suggest making a directory containing symlinks to the fast5s that are basecalled and present in the sequencing summary, then passing the directory-of-symlinks to nanopolish index. This is what we do to analyze an in-progress run.

RichardCorbett commented 3 years ago

Wonderful. Thanks @jts

RichardCorbett commented 3 years ago

Tried this out and indeed symlinking the fast5s referenced in the sequencing_summary.txt file allows the analysis to take just a minute or two per sample in the pool.