artic-network / rampart

Read Assignment, Mapping, and Phylogenetic Analysis in Real Time
GNU General Public License v3.0
79 stars 34 forks source link

Fastqs with the same name in different folders are excluded from analysis #73

Closed MaestSi closed 4 years ago

MaestSi commented 4 years ago

Hi, I am trying out rampart v1.1.0, starting from demultiplexed reads. I performed demultiplexing with guppy_barcoder using --require_barcodes_both_ends option, as suggested here. I am using example_data, and these reads survive the demultiplexing.

barcode01: 3488 barcode03: 213 barcode04: 5251

I have modified the run_configuration.json file accordingly:

{ "title": "EBOV Validation Run", "basecalledPath": "demultiplexing", "samples": [ { "name": "Mayinga", "description": "", "barcodes": [ "barcode01" ] }, { "name": "Negative Control", "description": "", "barcodes": [ "barcode02" ] }, { "name": "Kikwit", "description": "", "barcodes": [ "barcode03" ] }, { "name": "Makona", "description": "", "barcodes": [ "barcode04" ] } ] }

When running rampart, I found out that reads from Kikwit strain are never loaded. Is it due to the small number of reads (213) for that sample or is there something wrong in what I am doing? Moreover, could you please confirm that barcodes names specified in the run_configuration.json file should match the reads names in the header, and not the directory names instead? As a test, I tried renaming directory barcode01 in demultiplexing folder to, say, B1, and the results were the same. Thanks in advance, Simone

jameshadfield commented 4 years ago

Hey Simone -- for guppy (and porechop) demuxing, we parse the barcodes from the FASTQ contents, not the filename/dirname.

I would guess (but have no proof) that this is due to guppy keeping the kikwit fastq it is writing to open. Rampart will only read each fastq after it is closed, in order to avoid reading files which are still being written to. Do the other samples have accurate read counts, or are they (e.g.) multiples of 1000?

MaestSi commented 4 years ago

Hi, guppy_barcoder was run before rampart, so no fastq files should be open. Here are the number of reads processed by rampart: barcode01: 3488 (all of them) barcode03: 0 (out of 213) barcode04: 1251 (out of 5251). So, also for barcode04, 4000 reads are missing. Rampart also prints some warnings:

[warning]  Detected "new" FASTQ fastq_runid_e9e588bddbea1984a1556c61a8d53decbecf82e2_0 which has already been seen!
[warning]  Detected "new" FASTQ fastq_runid_e9e588bddbea1984a1556c61a8d53decbecf82e2_0 which has already been seen!
[warning]  Detected "new" FASTQ fastq_runid_e9e588bddbea1984a1556c61a8d53decbecf82e2_0 which has already been seen!
[warning]  Detected "new" FASTQ fastq_runid_e9e588bddbea1984a1556c61a8d53decbecf82e2_1 which has already been seen!

So, my guess is that rampart doesn't like files having the same name, but being in different folders due to different barcodes. As this is the way guppy_barcoder names fastqs, probably the folder name should be used too for naming csv files in annotations folder.

jameshadfield commented 4 years ago

my guess is that rampart doesn't like files having the same name, but being in different folders due to different barcodes

Spot on. I'll include the full path in the list of seen files. Thanks for tracking this down!

MaestSi commented 4 years ago

You're welcome!

jameshadfield commented 4 years ago

Hi @MaestSi -- if you have time, would you mind checking these data using rampart 1.2.0rc1 (the newest, pre-release version). It should be working now. You can install this in your conda environment by running

conda install artic-network/label/test::rampart==1.2.0rc1
MaestSi commented 4 years ago

Hi, I tried installing and running it but I got this error: node: /home/simone/miniconda3/envs/artic-rampart/bin/../lib/libcrypto.so.1.1: version OPENSSL_1_1_1b' not found (required by node)

MaestSi commented 4 years ago

Hi, I tried reinstalling the conda version in the master branch and when running rampart --help it showed the same error. However, I tried following instructions to Install from source and it worked perfectly, also the issue of fastqs with the same name looks like solved. Before, I used to rename them with:

#!/bin/bash

demultiplexing_dir=$1

for bc in $(find $demultiplexing_dir -maxdepth 1 | grep barcode) ; do
  bc_id=$(basename $bc)
  for f in $(find $bc -maxdepth 1 | grep \\.fastq) ; do
    curr_dir=$(dirname $f)
    mv $f $curr_dir"/"$bc_id"_"$(basename $f)
  done
done

but it looks like there is no need to do it anymore. Only the conda installation remains to be fixed. Thanks, Simone

jameshadfield commented 4 years ago

Thanks - this is good information to have. Will fix the conda install...

jameshadfield commented 4 years ago

This should be fixed in rampart v1.2.0, now available on conda. Please reopen if you have this issue again!