Open Malabady opened 2 years ago
Hi,
Do you have the correct environment setup? Doe sit work with -profile conda
?
I tried -profile Conda
first but it didn't work due to missing command mamba
. I am working on our computer cluster. Do you think using -profile conda
will solve the issue? the thing is, I am not sure what other issues might arise in that case.
It's the semicolon in -o 000000000_Scq3ke7_8492;HRSCAF=10725_np1212-239056-177420684_bundle_allONT-RNA_PassedReads.gff
It thinks it's the start of another command. This should quoted out at our end.
Can you remove semicolons form the sample_ids/filenames as a workaround for now until I can get a fix released?
I certainly can. Do I need to restart from the beginning? Can I use “—resume”?
Greetings,
also, would it cause any problems if I replace the ";" with "|" or "_"?
You can use -resume. And yes avoid "|" "/". "-" should be fine
Many thanks for your quick reply. I have another issue when using the denovo mode. Can I ask it here or better to open a new case so it is searchable or others?
Sorry, it failed again, see the following:
executor > local (5)
[6d/27bb6f] process > start_ping:pingMessage (1) [100%] 1 of 1, cached: 1 ✔
[21/191172] process > handleSingleFile (1) [100%] 1 of 1, cached: 1 ✔
[b4/580a85] process > pipeline:summariseConcatRea... [100%] 1 of 1, cached: 1 ✔
[95/f6afab] process > pipeline:getVersions [100%] 1 of 1, cached: 1 ✔
[16/fcdf00] process > pipeline:getParams [100%] 1 of 1, cached: 1 ✔
[10/a260bf] process > pipeline:preprocess_reads (1) [100%] 1 of 1, cached: 1 ✔
[39/57d482] process > pipeline:build_minimap_index [100%] 1 of 1 ✔
[3d/dc14d0] process > pipeline:reference_assembly... [100%] 1 of 1 ✔
[26/7ee945] process > pipeline:split_bam (1) [100%] 1 of 1 ✔
[21/b40a64] process > pipeline:assemble_transcrip... [100%] 1 of 1, failed: 1 ✘
[- ] process > pipeline:merge_gff_bundles -
[- ] process > pipeline:run_gffcompare -
[- ] process > pipeline:makeReport -
[- ] process > pipeline:get_transcriptome -
[57/20be6e] process > output (2) [100%] 2 of 2, cached: 1
[34/34cd82] process > end_ping:pingMessage [100%] 1 of 1, cached: 1 ✔
WARN: Input tuple does not match input set cardinality declared by process `pipeline:makeReport` -- offending value: [[]]
Error executing process > 'pipeline:assemble_transcripts (1)'
Caused by:
Process `pipeline:assemble_transcripts (1)` terminated with an error exit status (1)
Command executed:
stringtie --rf -L -v -A gene_abund.tab -p 4 --conservative -o 000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle_allONT-RNA_PassedReads.gff -l 0 000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle.bam 2>/dev/null
Command exit status:
1
Command output:
(empty)
Work dir:
/scratch/malabady/PitcherGenome/ONT-RNA/output/workspace/21/b40a643f208e5334605d7cf95c1453
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run
the exit error code is different this time. do you think I need to start from the beginning since the genome file headers are all changed now?
I have cleaned the workspace and a start he run from the beginning with the modified sequence headers. Unfortunately, it failed again with the following error:
executor > local (13)
[6d/27bb6f] process > start_ping:pingMessage (1) [100%] 1 of 1 ✔
[21/191172] process > handleSingleFile (1) [100%] 1 of 1 ✔
[e5/172e35] process > pipeline:summariseConcatRea... [100%] 1 of 1 ✔
[95/f6afab] process > pipeline:getVersions [100%] 1 of 1 ✔
[16/fcdf00] process > pipeline:getParams [100%] 1 of 1 ✔
[34/98ffa1] process > pipeline:preprocess_reads (1) [100%] 1 of 1 ✔
[39/57d482] process > pipeline:build_minimap_index [100%] 1 of 1 ✔
[04/5ca22a] process > pipeline:reference_assembly... [100%] 1 of 1 ✔
[96/97bcaf] process > pipeline:split_bam (1) [100%] 1 of 1 ✔
[08/0cc329] process > pipeline:assemble_transcrip... [100%] 1 of 1, failed: 1 ✘
[- ] process > pipeline:merge_gff_bundles -
[- ] process > pipeline:run_gffcompare -
[- ] process > pipeline:makeReport -
[- ] process > pipeline:get_transcriptome -
[c2/19955b] process > output (2) [100%] 2 of 2
[9b/2d1228] process > end_ping:pingMessage [100%] 1 of 1 ✔
WARN: Input tuple does not match input set cardinality declared by process `pipeline:makeReport` -- offending value: [[]]
Error executing process > 'pipeline:assemble_transcripts (1)'
Caused by:
Process `pipeline:assemble_transcripts (1)` terminated with an error exit status (1)
Command executed:
stringtie --rf -L -v -A gene_abund.tab -p 4 --conservative -o 000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle_allONT-RNA_PassedReads.gff -l 0 000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle.bam 2>/dev/null
Command exit status:
1
Command output:
(empty)
Work dir:
/scratch/malabady/PitcherGenome/ONT-RNA/output2/workspace/08/0cc32931e781f182a6b739a714fe3f
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Any suggestions?
Sorry that you're still having problems with this. Would you be able to send me nextflow.log
, this should be in the directory in which nextflow was run.
Hi Neil,
I am attaching two log files for the two most recent attempts. The first log is for a run using -proflie local
with fasta headers without “;”. When this run failed, I thought to try it using -profile conda
and -resume
. The 2nd log file is for this reattempt. Both failed with the same error when running stringtie
. We have StringTie/2.1.7-GCC-8.3.0
installed as a module, which is loaded before starting this workflow. I tested stringtie
alone and it is working just fine.
Thank you for looking into this issue.
Thanks for this. I should have also asked before, but can you also send me this file please? /scratch/malabady/PitcherGenome/ONT-RNA/output2/workspace/08/0cc32931e781f182a6b739a714fe3f/.commad.err
This particular .command.err
file is empty. but the file 000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle.bam
in the same directory is symbolically linked to a file with the same name in a different directory (/scratch/malabady/PitcherGenome/ONT-RNA/output2/workspace/96/97bcafff6c8c544851d7e3281ec859/000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle.bam
). under this other directory, the `.command.err , .command.log, .command.out, .command.trace' files are not empty. I am attaching all of them here. I also attached the stdout of the job run on the cluster.
command.err.txt command.log command.out.txt' command.trace.txt nflow.9254844.out.txt
I don't see anything informative in the logs. But I think the "=" in the filenames shouldn't be there. Please try removing these for the headers.
This actually worked. the pipeline completed successfully. Many thanks. Now to the actual results.
I am attaching the final html report for you take a look if you don't mind.
Is the correct cDNA kit specified? See the README.
Using the nhmmscan backend in pychopper should improve things a bit, but takes longer.
pychopper_opts = '-m phmm'
The cDNA reads and RNA reads should be run separately. This is because pychopper looks for the SSP and VNP primers that are absent in RNA reads.
If you want to run RNA data set use_pychopper = true
. Unfortunately this option was recently removed. I will get this back into the workflow and released as soon as I can.
I used the PCS109 cDNA kit, which is the default option. so, no need to specify pychopper_opts = '-k PCS109'
. If I understand correctly, all the dRNA reads were tossed away, which explains the high level of "unusable" reads. correct?
where exactly should I set use_pychopper = true
to run RNA dataset? in the command line? is the syntax just like that? sorry I didn't find that part in the readme or help.
That's correct. Your RNA reads likely are the majority of the unclassified set.
During some recent 'tidying up' of the code, I actually removed the use_pychopper
. Apologies for this. I aim to fix that in the next few days.
Thanks a lot for all your help. really appreciate it. I'll look forward for the updated version as I am more interested in the dRNA reads.
Hi Neil,
Understand you are busy. Just curious to know if you were able to implement use_pychopper = true for dRNA dataset ? Just like Malabady, most of my dRNA reads are also labeled as unusable.
Hi @Malabady, @SomaG-PSN There's a new release of the workflow.
nextflow run -r v0.1.2 epi2me-labs/wf-isoforms
For dRNA reads set the following in the config:
direct_rna = true
Please let me know if that doesn't work for you.
Thanks, Neil. I'll run as soon as I hear from you re: the following question. Do I need to covert the RNA (U) to DNA (T) first?
Thanks, Neil. Thanks for this. I will test it. Malabady, I dont think we need to convert RNA to DNA. If I understand it correctly, then direct_rna=true will take care of this.
The direct_rna
option just omits the pychopper step. You do not need to do any conversion on the reads.
Hi Neil,
Thanks again for the update. I tested it and the run completed "successfully". I used the following command:
"nextflow run -r v0.1.2 epi2me-labs/wf-isoforms --fastq
However, I was only able to map 11% of my reads to the reference genome. I am wondering if this is because of the strandedness and wanted to add "-uf" as an option for minimap2. To do so, I made the following change to the command:
"nextflow run -r v0.1.2 epi2me-labs/wf-isoforms --fastq
However, this throws an error message
[+] Loading java 17.0.1 ... [+] Loading nextflow 21.10.5 Unknown option: -uf -- Check the available commands and options and syntax with 'help'
Not sure what is wrong here.
P.S: I checked params.json in the "workspace" directory and it seems that "-uf" is included as a default option. Is that correct. Could you provide any clue why only 11% of the reads are mapping ?
Thanks a lot, Soma
Hi Soma,
The default "-uf" option is correct as we expect the RNA and cDNA reads (after reorientation) to be in the transcript orientation. If you want to set that option at the command line it may need a space like "-uf ".
Would you be able to put a small sample of your data here so I can test it out? https://nanoporetech.box.com/s/kji3udq4fauuk2snjqxe4ai2xq6gktrj
Neil
Hi Neil,
Thanks for the quick response. Adding space after "uf" works fine. However, still only 11% reads are aligning. I was unable to submit my files to the link that you provided, so I made my own link to share with you. Hope that is OK.
https://soma2.box.com/s/3g7ce471h5u2vaka0cfhtitqb40zb1bj
Thanks, Soma
Hi Neil,
I was able to run the new release successfully. As recommended on the minimap2 GitHub page, I passed the following options to minimap2 in the wf: -ax splice -uf
. here is what the minimap2 page says:
minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam # Nanopore Direct RNA-seq
My input number of reads is 3,490,710 passing guppy filters. however, it seems that only 723207 reads made to the mapping step and all of them were mapped (100%), see the following samtools flag stats:
723207 + 0 in total (QC-passed reads + QC-failed reads) 723207 + 0 primary 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 723207 + 0 mapped (100.00% : N/A)
I am going to rerun the analysis with -k14
in the minimap2 options. Is there anything else that you recommend?
Many thanks,
Hi Neil,
Thanks for the quick response. Adding space after "uf" works fine. However, still only 11% reads are aligning. I was unable to submit my files to the link that you provided, so I made my own link to share with you. Hope that is OK.
https://soma2.box.com/s/3g7ce471h5u2vaka0cfhtitqb40zb1bj
Thanks, Soma
Hi Soma,
A lot of your reads are mapping to multiple genomic locations. These reads are assigned a MAPQ score of 0 during alignment and are filtered out in this workflow. If you set the following minimum_mapping_quality=0
you should get more reads passing. But you may need to do some masking of repetitive regions in your reference.
Hi Neil,
I was able to run the new release successfully. As recommended on the minimap2 GitHub page, I passed the following options to minimap2 in the wf:
-ax splice -uf
. here is what the minimap2 page says:
minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam # Nanopore Direct RNA-seq
My input number of reads is 3,490,710 passing guppy filters. however, it seems that only 723207 reads made to the mapping step and all of them were mapped (100%), see the following samtools flag stats:
723207 + 0 in total (QC-passed reads + QC-failed reads) 723207 + 0 primary 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 723207 + 0 mapped (100.00% : N/A)
I am going to rerun the analysis with
-k14
in the minimap2 options. Is there anything else that you recommend?Many thanks,
The -k14 option is not required as it this get set with minimap_index_opts = "-k14"
in the default config.
If possible, you send me some data here so I can trouble-shoot: https://nanoporetech.ent.box.com/f/6cc0c561a3c142ec8dc0820aa492b2e5
Hi Neil,
How much data should I transfer to you? I have over 3.5 million fastq reads from thee different run. I have all passed fastq reads from the three runs combined in file, which I have using in the analysis.
I am trying to run the -denovo
pipeline as well. However, it keeps on failing with the following message. I am running it on a local node with 1900 GB RAM. it seems to be an issue with batches being too large. So, I tried to run it with more threads, so that the batches are smaller but it didn't work. Although I give --threads 12
in the command line, it still calculating the batches assuming only 4 threads let batch_size=$nr_bases/1000/4
. shouldn't the batch size be let batch_size=$nr_bases/1000/12
?
I tried changing the the threads number in the nextflow.config file, but it didn't work because of the uncommitted changes.
Any suggestion?
executor > local (1)
[56/aa7565] process > start_ping:pingMessage (1) [100%] 1 of 1, cached: 1 ✔
[09/67e6ae] process > handleSingleFile (1) [100%] 1 of 1, cached: 1 ✔
[a8/766df3] process > pipeline:summariseConcatReads (1) [100%] 1 of 1, cached: 1 ✔
[8a/d72128] process > pipeline:getVersions [100%] 1 of 1, cached: 1 ✔
[69/f896c3] process > pipeline:getParams [100%] 1 of 1, cached: 1 ✔
[54/37742b] process > pipeline:denovo_assembly:make_batches (1) [100%] 1 of 1, failed: 1 ✘
[- ] process > pipeline:denovo_assembly:clustering -
[- ] process > pipeline:denovo_assembly:dump_clusters -
[- ] process > pipeline:denovo_assembly:build_backbones -
[- ] process > pipeline:denovo_assembly:merge_cds -
[- ] process > pipeline:denovo_assembly:cds_align -
[- ] process > pipeline:denovo_assembly:cluster_quality -
[- ] process > pipeline:split_bam -
[- ] process > pipeline:assemble_transcripts -
[- ] process > pipeline:merge_gff_bundles -
[- ] process > pipeline:run_gffcompare -
[- ] process > pipeline:makeReport -
[- ] process > pipeline:get_transcriptome -
[- ] process > output -
[f2/e871b3] process > end_ping:pingMessage [100%] 1 of 1, cached: 1 ✔
WARN: Input tuple does not match input set cardinality declared by process `pipeline:makeReport` -- offending value: [[]]
Error executing process > 'pipeline:denovo_assembly:make_batches (1)'
Caused by:
Process `pipeline:denovo_assembly:make_batches (1)` terminated with an error exit status (134)
Command executed:
b=0
if [ -1 -lt $b ];
then
nr_bases=$(seqkit stats -T dRNA_run1-3_PassedReads.fastq|cut -f 5| sed '2q;d')
let batch_size=$nr_bases/1000/4
if [ $batch_size -lt 2000 ];
then
batch_size=2000
fi
else
batch_size=-1
fi
echo "Batch size:$batch_size";
echo "Num bases: $nr_bases";
init_cls_options="--batch-size $batch_size --kmer-size 11 --window-size 15 --min-shared 5 --min-qual 7.0 --mapped-threshold 0.65 --aligned-threshold 0.2 --min-fraction 0.8 --min-prob-no-hits 0.1 -M -1 -P 500 -g 50 -c -150 -F 2 "
mkdir -p sorted; isONclust2 sort $init_cls_options -v -o sorted dRNA_run1-3_PassedReads.fastq;
Command exit status:
134
Command output:
Batch size:909371
Num bases: 3637487117
Command error:
isONclust2 version: v2.3-e9da596
Batches output directory: sorted
Minimum batch size: 909371 kilobases
Kmer size: 11
Window size: 15
Consensus period: 500
Minimum cluster size for consensus: 50
Maximum cluster size for consensus: -150
Minimum average quality: 7
Minimum shared minimizers: 5
Minimum fraction of top minimizer hit: 0.8
Mapping threshold: 0.65
Alignment threshold: 0.2
Minimum probability no hit: 0.1
Minimum cluster size in left batches: 2
Debug output: off
Warning: reusing existing output directory: sorted
Parsed 3490686 sequences.
Finished sorting sequences.
Sorted sequences written to: sorted/sorted_reads.fastq
Scores written to: sorted/scores.tsv
Preparing batches:
terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
.command.sh: line 19: 12308 Aborted (core dumped) isONclust2 sort $init_cls_options -v -o sorted dRNA_run1-3_PassedReads.fastq
Hi Malabady,
Sorry for the late reply, I missed this message.
Please send me as many reads as you need to recreate the issue.
Neil
Hi.
I am having troubles getting epi2me-labs/wf-isofom to work with my ONT RNA data. In my mot recent attempt, I am using the following command:
But I am getting the following error:
I am using the following versions of nextflow and wf:
Any suggestions?
Many thanks,