error processing more multiple sample files

dnharry commented 9 months ago

Operating System

Windows 11

Other Linux

No response

Workflow Version

v0.3.5

Workflow Execution

EPI2ME Desktop application

EPI2ME Version

v5.1.3

CLI command run

No response

Workflow Execution - CLI Execution Profile

None

What happened?

I have fastq files in separate sample folders that run successfully, specifying 2 folders. I tried batch running on 10 folders and I received the below error. I reduced the number to 5 but I still received this error. Please help, thank you!

Relevant log output

ERROR ~ Error executing process > 'pipeline:makeReport'
Caused by:
  Process `pipeline:makeReport` terminated with an error exit status (1)
Command executed:
  workflow-glue report         --report-fname wf-amplicon-report.html         --data data         --reference reference.fasta                  --versions versions.txt         --params params.json
Command exit status:
  1
Command output:
  (empty)
Command error:
  [09:45:10 - matplotlib.font_manager] generated new fontManager
  /home/epi2melabs/conda/lib/python3.8/site-packages/si_prefix/__init__.py:44: DeprecationWarning: invalid escape sequence \s
    u'(?P<si_unit>[%s])?\s*' % SI_PREFIX_UNITS)
  /home/epi2melabs/conda/lib/python3.8/site-packages/si_prefix/__init__.py:249: DeprecationWarning: invalid escape sequence \s
    u'(?P<si_unit>[%s])?\s*$' % SI_PREFIX_UNITS)
  [09:45:11 - workflow_glue] Starting entrypoint.
  [E::idx_find_and_load] Could not retrieve index file for 'data/barcode21/medaka.annotated.vcf.gz'
  [E::idx_find_and_load] Could not retrieve index file for 'data/barcode28/medaka.annotated.vcf.gz'
  [E::idx_find_and_load] Could not retrieve index file for 'data/barcode29/medaka.annotated.vcf.gz'
  [E::idx_find_and_load] Could not retrieve index file for 'data/barcode30/medaka.annotated.vcf.gz'
  [E::idx_find_and_load] Could not retrieve index file for 'data/barcode31/medaka.annotated.vcf.gz'
  Traceback (most recent call last):
    File "/mnt/c/Users/Atajera_/epi2melabs/workflows/epi2me-labs/wf-amplicon/bin/workflow-glue", line 7, in <module>
      cli()
    File "/mnt/c/Users/Atajera_/epi2melabs/workflows/epi2me-labs/wf-amplicon/bin/workflow_glue/__init__.py", line 72, in cli
      args.func(args)
    File "/mnt/c/Users/Atajera_/epi2melabs/workflows/epi2me-labs/wf-amplicon/bin/workflow_glue/report.py", line 88, in main
      [util.ReportDataSet(d) for d in args.data.glob("*")],
    File "/mnt/c/Users/Atajera_/epi2melabs/workflows/epi2me-labs/wf-amplicon/bin/workflow_glue/report.py", line 88, in <listcomp>
      [util.ReportDataSet(d) for d in args.data.glob("*")],
    File "/mnt/c/Users/Atajera_/epi2melabs/workflows/epi2me-labs/wf-amplicon/bin/workflow_glue/report_util.py", line 80, in __init__
      (entry.info["SR"][2] + entry.info["SR"][3])
  ZeroDivisionError: division by zero
Work dir:
  /mnt/c/Users/Atajera_/epi2melabs/instances/wf-amplicon_01HC21YGHFCZ8X2RP1399VA2SP/work/c2/260a4423e104495b8e9df40601ae04
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
 -- Check '/mnt/c/Users/Atajera_/epi2melabs/instances/wf-amplicon_01HC21YGHFCZ8X2RP1399VA2SP/nextflow.log' file for details

Application activity log entry

None red

julibeg commented 9 months ago

Hi @dnharry!

It looks like one of these VCF files has a variant entry with DP=0 which is causing the division by zero error. Could you inspect the VCF files in /mnt/c/Users/Atajera_/epi2melabs/instances/wf-amplicon_01HC21YGHFCZ8X2RP1399VA2SP/work/c2/260a4423e104495b8e9df40601ae04/data?

In case you're unfamiliar with WSL, please let me know and I will send detailed instructions on how to do this.

dnharry commented 9 months ago

Thank you for the response

The is no VCF file. Attached are the files

julibeg commented 9 months ago

Hmm, the directory on the screenshot is not the working directory of the failed process (.../work/2c/aaf... instead of .../work/c2/260...). When navigating there, you perhaps might have swapped the letters of c2 to 2c?

In any case, please make sure to go to /mnt/c/Users/Atajera_/epi2melabs/instances/wf-amplicon_01HC21YGHFCZ8X2RP1399VA2SP/work/c2/260a4423e104495b8e9df40601ae04/data and there you should find VCF files in the different barcode directories.

dnharry commented 9 months ago

Sorry, my bad.

julibeg commented 9 months ago

Nice, this looks like the right location. These are symlinks pointing to the directories containing the files needed for creating the report. I'm afraid you won't be able to follow the symlink in the Windows File Explorer though (I didn't think about this earlier). Instead, please follow the steps below:

Open Powershell (or the Command Prompt if you prefer) and type wsl -d epi2me. This should drop you in the Linux shell of the EPI2ME WSL distribution that was installed alongside the EPI2ME desktop app.

Now, move to the work directory of the failed process by pasting

cd /mnt/c/Users/Atajera_/epi2melabs/instances/wf-amplicon_01HC21YGHFCZ8X2RP1399VA2SP/work/c2/260a4423e104495b8e9df40601ae04/data

(note the use of the cd command to change the directory).

Then, run

ls | xargs -i bash -c 'cp {}/*vcf.gz {}.vcf.gz'

to copy all VCFs from the barcode subdirs into the current dir (while also renaming them).

Next, we can make a new subdirectory, place all VCFs in it, and package it into a neat tarball so that you'll only have to upload a single file. Run the below

mkdir vcfs && mv *vcf.gz vcfs && tar czf vcfs.tar.gz vcfs

With a little luck the above all worked and if you run ls now, you should see the barcode symlinks and a file named vcfs.tar.gz. If that's the case, open the File Explorer with explorer.exe . (don't forget the .) and then please upload this file here (as long as it's not too large hopefully).

dnharry commented 9 months ago

Right! Please find below vcfs.tar.gz

julibeg commented 9 months ago

Actually, there is no need to upload files as it only takes one extra command to check them yourself. Could you please follow the steps below and let me know what you find.

Open Powershell (or Command Prompt) again, type wsl -d epi2me, and go to the directory with the VCFs you created (if you deleted the directory, just re-run the steps from the previous post above).
```
cd /mnt/c/Users/Atajera_/epi2melabs/instances/wf-amplicon_01HC21YGHFCZ8X2RP1399VA2SP/work/c2/260a4423e104495b8e9df40601ae04/data/vcfs
```
check for variants with DP=0
```
zgrep 'DP=0' *
```

dnharry commented 9 months ago

Right! Found it,

What do u suggest I do to avoid this issue?

julibeg commented 9 months ago

We will add a filter to remove such variants in the next release which should go out early next week. Unless you desperately need the results for barcode31 until then, I think the easiest workaround for you is to just drop this barcode from the input and run the wf only on the other barcodes for now.

dnharry commented 9 months ago

Sure, thank you very much!

julibeg commented 9 months ago

Looks like this is caused by an issue with how variants are annotated in Medaka. @dnharry, could you share the reads of barcode31 with us so that we can reproduce the problem and fix it upstream? Thanks!

dnharry commented 9 months ago

I sure can but it has multiple fastq files which sum to ~70 MB The link below is to a zipped of the fastq.gz files https://drive.google.com/file/d/1K6RnuiihwN3u0f2p8RaHbhSpk9wWMs8_/view?usp=drive_link

dnharry commented 9 months ago

I have also noticed the medaka.consensus.fasta in the consensus is the reference sequences instead of for the sample.

julibeg commented 9 months ago

I have also noticed the medaka.consensus.fasta in the consensus is the reference sequences instead of for the sample.

Does this mean that the report shows variants for a sample which are not reflected in the consensus? The consensus is generated by incorporating the variants found by medaka into the provided reference. So overall, the sequences should look fairly similar depending on how many variants were found.

cjw85 commented 9 months ago

Hi @dnharry,

I'm the lead developer of medaka, amongst other things, the google drive link you have posted requires you to grant permissions. I have requested access.

dnharry commented 9 months ago

Oh I have granted access

dnharry commented 9 months ago

.

Oh okay, I see. I provided the whole gene instead of the portion of the gene that was amplified as the reference.

Then I will have to feed in the exact region then.

julibeg commented 9 months ago

Hi @dnharry, To reproduce the problem, we would also need the reference you used (sorry, I forgot to mention this explicitly earlier).

However, it might actually be easier if you just share the input files that went into medaka directly. To find them, please open Powershell again, run wsl -d epi2me, and navigate to the makeReport work dir as before

cd /mnt/c/Users/Atajera_/epi2melabs/instances/wf-amplicon_01HC21YGHFCZ8X2RP1399VA2SP/work/c2/260a4423e104495b8e9df40601ae04/data

Nextflow relies on symlinks to make input files available to the relevant processes. We need to read the link of the VCF to find out where it was generated. The below reads the link, gets the parent directory, and puts it into a .tar.gz archive (notice the added h option to tell tar to dereference links and the -C to tell it to change directories to avoid absolute paths in the archive).

tar czfh medaka-variant-inputs.tar.gz -C "$(dirname "$(readlink barcode31/medaka.annotated.vcf.gz)")" .

Then, if you run explorer.exe . again, you should see a file called medaka-variant-inputs.tar.gz. This should also be a lot smaller since the workflow downsamples reads before running medaka. Please update this file here. Many thanks!

dnharry commented 9 months ago

Oh no worries. Done but the size is ~40 which falls short of Github's 25 MB. So here goes https://drive.google.com/file/d/1jFRB69GV0tLfjtHSL0D3fbpxrH98tLZg/view?usp=sharing

cjw85 commented 8 months ago

Hi @dnharry,

This issue will be resolved in the next release, to appear shortly.

dnharry commented 8 months ago

Hi @dnharry,

This issue will be resolved in the next release, to appear shortly.

Thank you!

julibeg commented 8 months ago

Hi @dnharry, the new release (v0.4.1) should fix this. If the problem persists, please let us know and re-open this issue.

epi2me-labs / wf-amplicon