epi2me-labs / wf-isoforms

Other
10 stars 5 forks source link

Error running wf-isofrom: ``command.run: line 281: 000000000_Scq3ke7_8492: command not found`` #6

Open Malabady opened 2 years ago

Malabady commented 2 years ago

Hi.

I am having troubles getting epi2me-labs/wf-isofom to work with my ONT RNA data. In my mot recent attempt, I am using the following command:

OUTPUT=./output;
nextflow run epi2me-labs/wf-isoforms --thread 12  --fastq ./allONT-RNA_PassedReads.fastq  --ref_genome ./genome \
 --minimap2_opts '-uf --splice-flank=no' --out_dir ${OUTPUT} -w ${OUTPUT}/workspace  -profile local -resume

But I am getting the following error:

executor >  local (13)
[6d/27bb6f] process > start_ping:pingMessage (1)     [100%] 1 of 1 ✔
[21/191172] process > handleSingleFile (1)           [100%] 1 of 1 ✔
[b4/580a85] process > pipeline:summariseConcatRea... [100%] 1 of 1 ✔
[95/f6afab] process > pipeline:getVersions           [100%] 1 of 1 ✔
[16/fcdf00] process > pipeline:getParams             [100%] 1 of 1 ✔
[10/a260bf] process > pipeline:preprocess_reads (1)  [100%] 1 of 1 ✔
[dd/c25228] process > pipeline:build_minimap_index   [100%] 1 of 1 ✔
[2b/1ebba1] process > pipeline:reference_assembly... [100%] 1 of 1 ✔
[73/bb6ebe] process > pipeline:split_bam (1)         [100%] 1 of 1 ✔
[e2/e6102f] process > pipeline:assemble_transcrip... [100%] 1 of 1, failed: 1 ✘
[-        ] process > pipeline:merge_gff_bundles     -
[-        ] process > pipeline:run_gffcompare        -
[-        ] process > pipeline:makeReport            -
[-        ] process > pipeline:get_transcriptome     -
[5a/3361e2] process > output (2)                     [100%] 2 of 2
[34/34cd82] process > end_ping:pingMessage           [100%] 1 of 1 ✔
WARN: Input tuple does not match input set cardinality declared by process `pipeline:makeReport` -- offending value: [[]]
Error executing process > 'pipeline:assemble_transcripts (1)'

Caused by:
  Process `pipeline:assemble_transcripts (1)` terminated with an error exit status (127)

Command executed:

  stringtie --rf  -L -v -A gene_abund.tab -p 4  --conservative  -o  000000000_Scq3ke7_8492;HRSCAF=10725_np1212-239056-177420684_bundle_allONT-RNA_PassedReads.gff         -l 0 000000000_Scq3ke7_8492;HRSCAF=10725_np1212-239056-177420684_bundle.bam 2>/dev/null

Command exit status:
  127

Command output:
  (empty)

Command wrapper:
  .command.run: line 281: 000000000_Scq3ke7_8492: command not found

Work dir:
  /scratch/malabady/PitcherGenome/ONT-RNA/output/workspace/e2/e6102f902985ef02bd4202b329afe7

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

I am using the following versions of nextflow and wf:

N E X T F L O W  ~  version 21.10.6
Launching `epi2me-labs/wf-isoforms` [jolly_einstein] - revision: 7dfea725a0 [master]

Any suggestions?

Many thanks,

nrhorner commented 2 years ago

Hi,

Do you have the correct environment setup? Doe sit work with -profile conda?

Malabady commented 2 years ago

I tried -profile Conda first but it didn't work due to missing command mamba. I am working on our computer cluster. Do you think using -profile conda will solve the issue? the thing is, I am not sure what other issues might arise in that case.

nrhorner commented 2 years ago

It's the semicolon in -o 000000000_Scq3ke7_8492;HRSCAF=10725_np1212-239056-177420684_bundle_allONT-RNA_PassedReads.gff

It thinks it's the start of another command. This should quoted out at our end.

nrhorner commented 2 years ago

Can you remove semicolons form the sample_ids/filenames as a workaround for now until I can get a fix released?

Malabady commented 2 years ago

I certainly can. Do I need to restart from the beginning? Can I use “—resume”?

Greetings,

Malabady commented 2 years ago

also, would it cause any problems if I replace the ";" with "|" or "_"?

nrhorner commented 2 years ago

You can use -resume. And yes avoid "|" "/". "-" should be fine

Malabady commented 2 years ago

Many thanks for your quick reply. I have another issue when using the denovo mode. Can I ask it here or better to open a new case so it is searchable or others?

Malabady commented 2 years ago

Sorry, it failed again, see the following:

executor >  local (5)
[6d/27bb6f] process > start_ping:pingMessage (1)     [100%] 1 of 1, cached: 1 ✔
[21/191172] process > handleSingleFile (1)           [100%] 1 of 1, cached: 1 ✔
[b4/580a85] process > pipeline:summariseConcatRea... [100%] 1 of 1, cached: 1 ✔
[95/f6afab] process > pipeline:getVersions           [100%] 1 of 1, cached: 1 ✔
[16/fcdf00] process > pipeline:getParams             [100%] 1 of 1, cached: 1 ✔
[10/a260bf] process > pipeline:preprocess_reads (1)  [100%] 1 of 1, cached: 1 ✔
[39/57d482] process > pipeline:build_minimap_index   [100%] 1 of 1 ✔
[3d/dc14d0] process > pipeline:reference_assembly... [100%] 1 of 1 ✔
[26/7ee945] process > pipeline:split_bam (1)         [100%] 1 of 1 ✔
[21/b40a64] process > pipeline:assemble_transcrip... [100%] 1 of 1, failed: 1 ✘
[-        ] process > pipeline:merge_gff_bundles     -
[-        ] process > pipeline:run_gffcompare        -
[-        ] process > pipeline:makeReport            -
[-        ] process > pipeline:get_transcriptome     -
[57/20be6e] process > output (2)                     [100%] 2 of 2, cached: 1
[34/34cd82] process > end_ping:pingMessage           [100%] 1 of 1, cached: 1 ✔
WARN: Input tuple does not match input set cardinality declared by process `pipeline:makeReport` -- offending value: [[]]
Error executing process > 'pipeline:assemble_transcripts (1)'

Caused by:
  Process `pipeline:assemble_transcripts (1)` terminated with an error exit status (1)

Command executed:

  stringtie --rf  -L -v -A gene_abund.tab -p 4  --conservative  -o  000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle_allONT-RNA_PassedReads.gff         -l 0 000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle.bam 2>/dev/null

Command exit status:
  1

Command output:
  (empty)

Work dir:
  /scratch/malabady/PitcherGenome/ONT-RNA/output/workspace/21/b40a643f208e5334605d7cf95c1453

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run

the exit error code is different this time. do you think I need to start from the beginning since the genome file headers are all changed now?

Malabady commented 2 years ago

I have cleaned the workspace and a start he run from the beginning with the modified sequence headers. Unfortunately, it failed again with the following error:

executor >  local (13)
[6d/27bb6f] process > start_ping:pingMessage (1)     [100%] 1 of 1 ✔
[21/191172] process > handleSingleFile (1)           [100%] 1 of 1 ✔
[e5/172e35] process > pipeline:summariseConcatRea... [100%] 1 of 1 ✔
[95/f6afab] process > pipeline:getVersions           [100%] 1 of 1 ✔
[16/fcdf00] process > pipeline:getParams             [100%] 1 of 1 ✔
[34/98ffa1] process > pipeline:preprocess_reads (1)  [100%] 1 of 1 ✔
[39/57d482] process > pipeline:build_minimap_index   [100%] 1 of 1 ✔
[04/5ca22a] process > pipeline:reference_assembly... [100%] 1 of 1 ✔
[96/97bcaf] process > pipeline:split_bam (1)         [100%] 1 of 1 ✔
[08/0cc329] process > pipeline:assemble_transcrip... [100%] 1 of 1, failed: 1 ✘
[-        ] process > pipeline:merge_gff_bundles     -
[-        ] process > pipeline:run_gffcompare        -
[-        ] process > pipeline:makeReport            -
[-        ] process > pipeline:get_transcriptome     -
[c2/19955b] process > output (2)                     [100%] 2 of 2
[9b/2d1228] process > end_ping:pingMessage           [100%] 1 of 1 ✔
WARN: Input tuple does not match input set cardinality declared by process `pipeline:makeReport` -- offending value: [[]]
Error executing process > 'pipeline:assemble_transcripts (1)'

Caused by:
  Process `pipeline:assemble_transcripts (1)` terminated with an error exit status (1)

Command executed:

  stringtie --rf  -L -v -A gene_abund.tab -p 4  --conservative  -o  000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle_allONT-RNA_PassedReads.gff         -l 0 000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle.bam 2>/dev/null

Command exit status:
  1

Command output:
  (empty)

Work dir:
  /scratch/malabady/PitcherGenome/ONT-RNA/output2/workspace/08/0cc32931e781f182a6b739a714fe3f

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Any suggestions?

nrhorner commented 2 years ago

Sorry that you're still having problems with this. Would you be able to send me nextflow.log, this should be in the directory in which nextflow was run.

Malabady commented 2 years ago

Hi Neil, I am attaching two log files for the two most recent attempts. The first log is for a run using -proflie local with fasta headers without “;”. When this run failed, I thought to try it using -profile conda and -resume. The 2nd log file is for this reattempt. Both failed with the same error when running stringtie. We have StringTie/2.1.7-GCC-8.3.0 installed as a module, which is loaded before starting this workflow. I tested stringtie alone and it is working just fine.

Thank you for looking into this issue.

nextflow.log nextflow.1.log

nrhorner commented 2 years ago

Thanks for this. I should have also asked before, but can you also send me this file please? /scratch/malabady/PitcherGenome/ONT-RNA/output2/workspace/08/0cc32931e781f182a6b739a714fe3f/.commad.err

Malabady commented 2 years ago

This particular .command.err file is empty. but the file 000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle.bam in the same directory is symbolically linked to a file with the same name in a different directory (/scratch/malabady/PitcherGenome/ONT-RNA/output2/workspace/96/97bcafff6c8c544851d7e3281ec859/000000000_Scq3ke7_8492-HRSCAF=10725_np1212-239056-177420684_bundle.bam). under this other directory, the `.command.err , .command.log, .command.out, .command.trace' files are not empty. I am attaching all of them here. I also attached the stdout of the job run on the cluster.

command.err.txt command.log command.out.txt' command.trace.txt nflow.9254844.out.txt

nrhorner commented 2 years ago

I don't see anything informative in the logs. But I think the "=" in the filenames shouldn't be there. Please try removing these for the headers.

Malabady commented 2 years ago

This actually worked. the pipeline completed successfully. Many thanks. Now to the actual results.

I am attaching the final html report for you take a look if you don't mind.

wf-isoforms-report.html.txt

nrhorner commented 2 years ago

Is the correct cDNA kit specified? See the README.

Using the nhmmscan backend in pychopper should improve things a bit, but takes longer. pychopper_opts = '-m phmm'

The cDNA reads and RNA reads should be run separately. This is because pychopper looks for the SSP and VNP primers that are absent in RNA reads.

If you want to run RNA data set use_pychopper = true. Unfortunately this option was recently removed. I will get this back into the workflow and released as soon as I can.

Malabady commented 2 years ago

I used the PCS109 cDNA kit, which is the default option. so, no need to specify pychopper_opts = '-k PCS109'. If I understand correctly, all the dRNA reads were tossed away, which explains the high level of "unusable" reads. correct?

where exactly should I set use_pychopper = true to run RNA dataset? in the command line? is the syntax just like that? sorry I didn't find that part in the readme or help.

nrhorner commented 2 years ago

That's correct. Your RNA reads likely are the majority of the unclassified set.

During some recent 'tidying up' of the code, I actually removed the use_pychopper. Apologies for this. I aim to fix that in the next few days.

Malabady commented 2 years ago

Thanks a lot for all your help. really appreciate it. I'll look forward for the updated version as I am more interested in the dRNA reads.

SomaG-PSN commented 2 years ago

Hi Neil,

Understand you are busy. Just curious to know if you were able to implement use_pychopper = true for dRNA dataset ? Just like Malabady, most of my dRNA reads are also labeled as unusable.

nrhorner commented 2 years ago

Hi @Malabady, @SomaG-PSN There's a new release of the workflow.

nextflow run -r v0.1.2 epi2me-labs/wf-isoforms

For dRNA reads set the following in the config: direct_rna = true

Please let me know if that doesn't work for you.

Malabady commented 2 years ago

Thanks, Neil. I'll run as soon as I hear from you re: the following question. Do I need to covert the RNA (U) to DNA (T) first?

SomaG-PSN commented 2 years ago

Thanks, Neil. Thanks for this. I will test it. Malabady, I dont think we need to convert RNA to DNA. If I understand it correctly, then direct_rna=true will take care of this.

nrhorner commented 2 years ago

The direct_rna option just omits the pychopper step. You do not need to do any conversion on the reads.

SomaG-PSN commented 2 years ago

Hi Neil,

Thanks again for the update. I tested it and the run completed "successfully". I used the following command:

"nextflow run -r v0.1.2 epi2me-labs/wf-isoforms --fastq --ref_genome --direct_rna true --ref_annotation -profile conda --out_dir ./output_rb -w ./output_rb/workspace --sample test"

However, I was only able to map 11% of my reads to the reference genome. I am wondering if this is because of the strandedness and wanted to add "-uf" as an option for minimap2. To do so, I made the following change to the command:

"nextflow run -r v0.1.2 epi2me-labs/wf-isoforms --fastq --ref_genome --direct_rna true --ref_annotation --minimap2_opts '-uf' -profile conda --out_dir ./output_rb -w ./output_rb/workspace --sample test".

However, this throws an error message

[+] Loading java 17.0.1 ... [+] Loading nextflow 21.10.5 Unknown option: -uf -- Check the available commands and options and syntax with 'help'

Not sure what is wrong here.

P.S: I checked params.json in the "workspace" directory and it seems that "-uf" is included as a default option. Is that correct. Could you provide any clue why only 11% of the reads are mapping ?

Thanks a lot, Soma

nrhorner commented 2 years ago

Hi Soma,

The default "-uf" option is correct as we expect the RNA and cDNA reads (after reorientation) to be in the transcript orientation. If you want to set that option at the command line it may need a space like "-uf ".

Would you be able to put a small sample of your data here so I can test it out? https://nanoporetech.box.com/s/kji3udq4fauuk2snjqxe4ai2xq6gktrj

Neil

SomaG-PSN commented 2 years ago

Hi Neil,

Thanks for the quick response. Adding space after "uf" works fine. However, still only 11% reads are aligning. I was unable to submit my files to the link that you provided, so I made my own link to share with you. Hope that is OK.

https://soma2.box.com/s/3g7ce471h5u2vaka0cfhtitqb40zb1bj

Thanks, Soma

Malabady commented 2 years ago

Hi Neil,

I was able to run the new release successfully. As recommended on the minimap2 GitHub page, I passed the following options to minimap2 in the wf: -ax splice -uf. here is what the minimap2 page says:

minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam # Nanopore Direct RNA-seq

My input number of reads is 3,490,710 passing guppy filters. however, it seems that only 723207 reads made to the mapping step and all of them were mapped (100%), see the following samtools flag stats:

723207 + 0 in total (QC-passed reads + QC-failed reads) 723207 + 0 primary 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 723207 + 0 mapped (100.00% : N/A)

I am going to rerun the analysis with -k14 in the minimap2 options. Is there anything else that you recommend?

Many thanks,

nrhorner commented 2 years ago

Hi Neil,

Thanks for the quick response. Adding space after "uf" works fine. However, still only 11% reads are aligning. I was unable to submit my files to the link that you provided, so I made my own link to share with you. Hope that is OK.

https://soma2.box.com/s/3g7ce471h5u2vaka0cfhtitqb40zb1bj

Thanks, Soma

Hi Soma,

A lot of your reads are mapping to multiple genomic locations. These reads are assigned a MAPQ score of 0 during alignment and are filtered out in this workflow. If you set the following minimum_mapping_quality=0 you should get more reads passing. But you may need to do some masking of repetitive regions in your reference.

nrhorner commented 2 years ago

Hi Neil,

I was able to run the new release successfully. As recommended on the minimap2 GitHub page, I passed the following options to minimap2 in the wf: -ax splice -uf. here is what the minimap2 page says:

minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam # Nanopore Direct RNA-seq

My input number of reads is 3,490,710 passing guppy filters. however, it seems that only 723207 reads made to the mapping step and all of them were mapped (100%), see the following samtools flag stats:

723207 + 0 in total (QC-passed reads + QC-failed reads) 723207 + 0 primary 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 723207 + 0 mapped (100.00% : N/A)

I am going to rerun the analysis with -k14 in the minimap2 options. Is there anything else that you recommend?

Many thanks,

The -k14 option is not required as it this get set with minimap_index_opts = "-k14" in the default config.

If possible, you send me some data here so I can trouble-shoot: https://nanoporetech.ent.box.com/f/6cc0c561a3c142ec8dc0820aa492b2e5

Malabady commented 2 years ago

Hi Neil,

How much data should I transfer to you? I have over 3.5 million fastq reads from thee different run. I have all passed fastq reads from the three runs combined in file, which I have using in the analysis.

I am trying to run the -denovo pipeline as well. However, it keeps on failing with the following message. I am running it on a local node with 1900 GB RAM. it seems to be an issue with batches being too large. So, I tried to run it with more threads, so that the batches are smaller but it didn't work. Although I give --threads 12 in the command line, it still calculating the batches assuming only 4 threads let batch_size=$nr_bases/1000/4. shouldn't the batch size be let batch_size=$nr_bases/1000/12? I tried changing the the threads number in the nextflow.config file, but it didn't work because of the uncommitted changes.

Any suggestion?


executor >  local (1)
[56/aa7565] process > start_ping:pingMessage (1)                [100%] 1 of 1, cached: 1 ✔
[09/67e6ae] process > handleSingleFile (1)                      [100%] 1 of 1, cached: 1 ✔
[a8/766df3] process > pipeline:summariseConcatReads (1)         [100%] 1 of 1, cached: 1 ✔
[8a/d72128] process > pipeline:getVersions                      [100%] 1 of 1, cached: 1 ✔
[69/f896c3] process > pipeline:getParams                        [100%] 1 of 1, cached: 1 ✔
[54/37742b] process > pipeline:denovo_assembly:make_batches (1) [100%] 1 of 1, failed: 1 ✘
[-        ] process > pipeline:denovo_assembly:clustering       -
[-        ] process > pipeline:denovo_assembly:dump_clusters    -
[-        ] process > pipeline:denovo_assembly:build_backbones  -
[-        ] process > pipeline:denovo_assembly:merge_cds        -
[-        ] process > pipeline:denovo_assembly:cds_align        -
[-        ] process > pipeline:denovo_assembly:cluster_quality  -
[-        ] process > pipeline:split_bam                        -
[-        ] process > pipeline:assemble_transcripts             -
[-        ] process > pipeline:merge_gff_bundles                -
[-        ] process > pipeline:run_gffcompare                   -
[-        ] process > pipeline:makeReport                       -
[-        ] process > pipeline:get_transcriptome                -
[-        ] process > output                                    -
[f2/e871b3] process > end_ping:pingMessage                      [100%] 1 of 1, cached: 1 ✔
WARN: Input tuple does not match input set cardinality declared by process `pipeline:makeReport` -- offending value: [[]]
Error executing process > 'pipeline:denovo_assembly:make_batches (1)'

Caused by:
  Process `pipeline:denovo_assembly:make_batches (1)` terminated with an error exit status (134)

Command executed:

  b=0
  if [ -1 -lt $b ];
  then
      nr_bases=$(seqkit stats -T dRNA_run1-3_PassedReads.fastq|cut -f 5| sed '2q;d')
      let batch_size=$nr_bases/1000/4
      if [ $batch_size -lt 2000 ];
          then
              batch_size=2000
      fi
  else
      batch_size=-1
  fi

  echo "Batch size:$batch_size";
  echo "Num bases: $nr_bases";

  init_cls_options="--batch-size $batch_size --kmer-size 11 --window-size 15         --min-shared 5 --min-qual 7.0         --mapped-threshold 0.65 --aligned-threshold 0.2         --min-fraction 0.8 --min-prob-no-hits 0.1         -M -1 -P 500 -g 50         -c -150 -F 2 "
  mkdir -p sorted; isONclust2 sort $init_cls_options -v -o sorted dRNA_run1-3_PassedReads.fastq;

Command exit status:
  134

Command output:
  Batch size:909371
  Num bases: 3637487117

Command error:
  isONclust2 version: v2.3-e9da596
  Batches output directory: sorted
  Minimum batch size: 909371 kilobases
  Kmer size: 11
  Window size: 15
  Consensus period: 500
  Minimum cluster size for consensus: 50
  Maximum cluster size for consensus: -150
  Minimum average quality: 7
  Minimum shared minimizers: 5
  Minimum fraction of top minimizer hit: 0.8
  Mapping threshold: 0.65
  Alignment threshold: 0.2
  Minimum probability no hit: 0.1
  Minimum cluster size in left batches: 2
  Debug output: off
  Warning: reusing existing output directory: sorted
  Parsed 3490686 sequences.
  Finished sorting sequences.
  Sorted sequences written to: sorted/sorted_reads.fastq
  Scores written to: sorted/scores.tsv
  Preparing batches:
  terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
  .command.sh: line 19: 12308 Aborted                 (core dumped) isONclust2 sort $init_cls_options -v -o sorted dRNA_run1-3_PassedReads.fastq
nrhorner commented 2 years ago

Hi Malabady,

Sorry for the late reply, I missed this message.

Please send me as many reads as you need to recreate the issue.

Neil