bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
984 stars 353 forks source link

microRNAseq analysis using bcbio for non model organisms #2427

Closed WimSpee closed 5 years ago

WimSpee commented 6 years ago

Hi,

Do you expect that the microRNAseq analysis capability provided by bcbio would make sense for analysis of microRNAseq data of non model organisms?

I am trying to see if I can process the Capsicum annuum microRNAseq data generated in this project using bcbio: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA177852

I am new to microRNAseq analysis so I am not really sure how to run this analysis and I am also not sure what kind of output I should expect.

The following is the yaml file that I am using:

upload:
  dir: ../final
details:
  - analysis: smallRNA-seq
    algorithm:
      aligner: star # any other aligner is supported.
      # change adapter according project
      # adapters: ["TGGAATTCTCGGGTGC"]
      expression_caller: [ seqcluster, mirdeep2]
      # expression_caller: [trna, seqcluster, mirdeep2, mirge] Read docs to know how to use
      # miRge tools: https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq
      # species: hsa
    genome_build: my_ref
#resources:
#  atropos:
#    options: ["-u 4", "-u -4"]
#  mirge:
#    options: ["-lib $PATH_TO_LIBS_FOLDER"]

This is the log file produced by the analysis.

[2018-06-27T18:55Z] grid_controller: System YAML configuration: /workspace/my_user/tmp_bcbio_1.1.0_development/data_dir/galaxy/bcbio_system.yaml
[2018-06-27T18:56Z] grid_controller: Timing: organize samples
[2018-06-27T18:56Z] grid_controller: ipython: organize_samples
[2018-06-27T18:56Z] exeuction_node_20: Using input YAML configuration: /leading_dir/config/DA_1164_samples-merged.
yaml
[2018-06-27T18:56Z] exeuction_node_20: Checking sample YAML configuration: /leading_dir/config/DA_1164_samples-mer
ged.yaml
[2018-06-27T18:56Z] exeuction_node_20: Testing minimum versions of installed programs
[2018-06-27T18:56Z] grid_controller: ipython: prepare_sample
[2018-06-27T18:56Z] grid_controller: Timing: adapter trimming
[2018-06-27T18:56Z] grid_controller: ipython: trim_srna_sample
[2018-06-27T19:26Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_01/DA_1164_01.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T20:17Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_02/DA_1164_02.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T21:00Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_03/DA_1164_03.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T21:42Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_04/DA_1164_04.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T22:20Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_05/DA_1164_05.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T22:55Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_06/DA_1164_06.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T23:31Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_07/DA_1164_07.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-28T00:28Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_08/DA_1164_08.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-28T01:11Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_09/DA_1164_09.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-28T01:54Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_10/DA_1164_10.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-28T02:13Z] grid_controller: Timing: prepare
[2018-06-28T02:13Z] grid_controller: ipython: seqcluster_prepare
[2018-06-28T03:05Z] exeuction_node_24: Prepare seqs.fastq with -minl 17 -maxl 40 -minc 2 --min_shared 0.1
[2018-06-28T03:08Z] grid_controller: Timing: alignment
[2018-06-28T03:08Z] grid_controller: ipython: srna_alignment
[2018-06-28T03:08Z] exeuction_node_24: Aligning lane DA_1164_01 with star aligner
[2018-06-28T03:11Z] exeuction_node_24: mirdeep2 Rfam file not instaled. Skipping...
[2018-06-28T03:11Z] grid_controller: Timing: small RNA annotation
[2018-06-28T03:11Z] grid_controller: ipython: srna_annotation
[2018-06-28T03:12Z] grid_controller: Timing: cluster
[2018-06-28T03:12Z] grid_controller: ipython: seqcluster_cluster
[2018-06-28T04:59Z] grid_controller: Timing: quality control
[2018-06-28T04:59Z] grid_controller: ipython: pipeline_summary
[2018-06-28T04:59Z] exeuction_node_20: QC: DA_1164_01 fastqc

I am not sure how to specify that dnapi should be run for de-novo adapter detection followed by adapter trimming. As far as I can tell dnapi was not used for adapter trimming. The fastqc part of the multiqQC report shows that of the 50bp reads the last 25 bp is almost 100% adapters.

As far as I can tell Capsicum annuum is not in mirbase. Therefore I did not enter a 3 letter species code. I am not sure if it makes sense to just enter the species code of a somewhat related species http://www.mirbase.org/cgi-bin/mirna_summary.pl?org=sly
Or that I better just don't provide a species code.

The analysis did not seem to produce much results. See the file list at the bottom of this comment. Then again I am also not sure what to expect.

The lack of output might in part be because mirdeep2 Rfam not being installed/found. Should I have done that myself?

[2018-06-28T03:08Z] exeuction_node_24: Aligning lane DA_1164_01 with star aligner
[2018-06-28T03:11Z] exeuction_node_24: mirdeep2 Rfam file not instaled. Skipping...

What I kind of expect as output for an microRNAseq analysis is:

Do you think it is possible to get the above results using bcbio for microRNAseq data of a non model organism? How would I then do that using bcbio? Is the yaml that I use correct? Should I add tRNA as an expression caller?

Since I am new to microRNAseq the bcbio microRNAseq documentation is also a bit short me. I would also very much appreciate it if you can point to me a recent sort of best practice method / review paper that describes the method(s) that bcbio in general tries to provide for microRNAseq analysis.

Thank you very much!

final/
final/DA_1164_05
final/DA_1164_05/qc
final/DA_1164_05/qc/fastqc
final/DA_1164_05/qc/fastqc/fastqc_report.html
final/DA_1164_05/qc/fastqc/fastqc_data.txt
final/DA_1164_05/qc/fastqc/DA_1164_05.zip
final/DA_1164_05/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_05/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_05/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_05/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_05/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_05/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_05/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_05/DA_1164_05-ready.trimming_stats
final/DA_1164_04
final/DA_1164_04/qc
final/DA_1164_04/qc/fastqc
final/DA_1164_04/qc/fastqc/fastqc_report.html
final/DA_1164_04/qc/fastqc/fastqc_data.txt
final/DA_1164_04/qc/fastqc/DA_1164_04.zip
final/DA_1164_04/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_04/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_04/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_04/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_04/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_04/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_04/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_04/DA_1164_04-ready.trimming_stats
final/DA_1164_09
final/DA_1164_09/qc
final/DA_1164_09/qc/fastqc
final/DA_1164_09/qc/fastqc/fastqc_report.html
final/DA_1164_09/qc/fastqc/fastqc_data.txt
final/DA_1164_09/qc/fastqc/DA_1164_09.zip
final/DA_1164_09/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_09/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_09/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_09/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_09/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_09/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_09/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_09/DA_1164_09-ready.trimming_stats
final/DA_1164_08
final/DA_1164_08/qc
final/DA_1164_08/qc/fastqc
final/DA_1164_08/qc/fastqc/fastqc_report.html
final/DA_1164_08/qc/fastqc/fastqc_data.txt
final/DA_1164_08/qc/fastqc/DA_1164_08.zip
final/DA_1164_08/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_08/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_08/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_08/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_08/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_08/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_08/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_08/DA_1164_08-ready.trimming_stats
final/DA_1164_07
final/DA_1164_07/qc
final/DA_1164_07/qc/fastqc
final/DA_1164_07/qc/fastqc/fastqc_report.html
final/DA_1164_07/qc/fastqc/fastqc_data.txt
final/DA_1164_07/qc/fastqc/DA_1164_07.zip
final/DA_1164_07/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_07/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_07/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_07/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_07/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_07/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_07/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_07/DA_1164_07-ready.trimming_stats
final/DA_1164_06
final/DA_1164_06/qc
final/DA_1164_06/qc/fastqc
final/DA_1164_06/qc/fastqc/fastqc_report.html
final/DA_1164_06/qc/fastqc/fastqc_data.txt
final/DA_1164_06/qc/fastqc/DA_1164_06.zip
final/DA_1164_06/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_06/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_06/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_06/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_06/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_06/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_06/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_06/DA_1164_06-ready.trimming_stats
final/DA_1164_01
final/DA_1164_01/qc
final/DA_1164_01/qc/fastqc
final/DA_1164_01/qc/fastqc/fastqc_report.html
final/DA_1164_01/qc/fastqc/fastqc_data.txt
final/DA_1164_01/qc/fastqc/DA_1164_01.zip
final/DA_1164_01/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_01/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_01/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_01/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_01/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_01/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_01/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_01/qc/small-rna
final/DA_1164_01/qc/small-rna/DA_1164_01.txt
final/DA_1164_01/DA_1164_01-ready.trimming_stats
final/DA_1164_03
final/DA_1164_03/qc
final/DA_1164_03/qc/fastqc
final/DA_1164_03/qc/fastqc/fastqc_report.html
final/DA_1164_03/qc/fastqc/fastqc_data.txt
final/DA_1164_03/qc/fastqc/DA_1164_03.zip
final/DA_1164_03/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_03/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_03/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_03/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_03/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_03/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_03/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_03/DA_1164_03-ready.trimming_stats
final/DA_1164_02
final/DA_1164_02/qc
final/DA_1164_02/qc/fastqc
final/DA_1164_02/qc/fastqc/fastqc_report.html
final/DA_1164_02/qc/fastqc/fastqc_data.txt
final/DA_1164_02/qc/fastqc/DA_1164_02.zip
final/DA_1164_02/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_02/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_02/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_02/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_02/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_02/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_02/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_02/DA_1164_02-ready.trimming_stats
final/DA_1164_10
final/DA_1164_10/qc
final/DA_1164_10/qc/fastqc
final/DA_1164_10/qc/fastqc/fastqc_report.html
final/DA_1164_10/qc/fastqc/fastqc_data.txt
final/DA_1164_10/qc/fastqc/DA_1164_10.zip
final/DA_1164_10/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_10/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_10/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_10/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_10/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_10/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_10/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_10/DA_1164_10-ready.trimming_stats
final/2018-06-28_DA_1164_samples-merged
final/2018-06-28_DA_1164_samples-merged/programs.txt
final/2018-06-28_DA_1164_samples-merged/bcbio-nextgen.log
final/2018-06-28_DA_1164_samples-merged/bcbio-nextgen-commands.log
final/2018-06-28_DA_1164_samples-merged/project-summary.yaml
final/2018-06-28_DA_1164_samples-merged/report
final/2018-06-28_DA_1164_samples-merged/report/srna_report.rmd
final/2018-06-28_DA_1164_samples-merged/report/summary.csv
final/2018-06-28_DA_1164_samples-merged/multiqc
final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_report.html
final/2018-06-28_DA_1164_samples-merged/multiqc/report
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_08_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_04_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_07_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_06_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_02_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_05_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_10_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_01_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_09_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_03_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_config.yaml
final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_data
final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_data/multiqc_data_final.json
final/2018-06-28_DA_1164_samples-merged/multiqc/list_files_final.txt
final/2018-06-28_DA_1164_samples-merged/seqcluster
final/2018-06-28_DA_1164_samples-merged/seqcluster/log
final/2018-06-28_DA_1164_samples-merged/seqcluster/log/run.log
final/2018-06-28_DA_1164_samples-merged/seqcluster/log/trace.log
final/2018-06-28_DA_1164_samples-merged/seqcluster/seqs_rmlw.bam_cov.tsv
final/2018-06-28_DA_1164_samples-merged/seqcluster/read_stats.tsv
final/2018-06-28_DA_1164_samples-merged/seqcluster/cluster.bed
final/2018-06-28_DA_1164_samples-merged/seqcluster/list_obj.pk
final/2018-06-28_DA_1164_samples-merged/seqcluster/list_obj_red.pk
final/2018-06-28_DA_1164_samples-merged/seqcluster/counts.tsv
final/2018-06-28_DA_1164_samples-merged/seqcluster/size_counts.tsv
final/2018-06-28_DA_1164_samples-merged/seqcluster/positions.bed
final/2018-06-28_DA_1164_samples-merged/seqcluster/counts_sequence.tsv
final/2018-06-28_DA_1164_samples-merged/seqcluster/seqcluster.json
final/2018-06-28_DA_1164_samples-merged/seqclusterViz
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/log
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/log/run.log
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/log/trace.log
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/profiles
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/profiles/344
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/profiles/5
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/seqcluster.db
lpantano commented 6 years ago

Hi,

Thanks for the questions.

Sadly, for non-model organism, the only analysis that will run is seqcluster that will generate small RNA loci over the genome and the expression of them (seqcluster/counts.tsv) that you can use with DESeq2. As well you can visualize this data with: https://github.com/lpantano/seqclusterViz https://github.com/lpantano/seqclusterViz, you’ll need to download the repo, open the index.html file and load the seqclusterViz/seqcluster.db file.

Mirdeep won’t work on plants, for that we’ll need a plant prediction tool, I know of one but is not integrated, I can work on that, but it will take a month or so. When you got the trimming happens, maybe mirdeep runs successfully and you get some prediction, not sure if I’ll trust so much that.

To make the trimming happen you need to activate trim_reads : True under algorithm in the yaml file.

If you know there is a similar species in mirbase, let me know and I can help with set in up the files for that.

Thanks for trying bcbio, happy to help to get the trimming working and the annotation with mirbase if you know of a similar species.

Cheers

On Jun 29, 2018, at 12:41 AM, WimSpee notifications@github.com wrote:

Hi,

Do you expect that the microRNAseq analysis capability provided by bcbio would make sense for analysis of microRNAseq data of non model organisms?

I am trying to see if I can process the Capsicum annuum microRNAseq data generated in this project using bcbio: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA177852 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA177852 I am new to microRNAseq analysis so I am not really sure how to run this analysis and I am also not sure what kind of output I should expect.

The following is the yaml file that I am using:

upload: dir: ../final details:

  • analysis: smallRNA-seq algorithm: aligner: star # any other aligner is supported.

    change adapter according project

    adapters: ["TGGAATTCTCGGGTGC"]

    expression_caller: [ seqcluster, mirdeep2]

    expression_caller: [trna, seqcluster, mirdeep2, mirge] Read docs to know how to use

    miRge tools: https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq

    species: hsa

    genome_build: my_ref

    resources:

    atropos:

    options: ["-u 4", "-u -4"]

    mirge:

    options: ["-lib $PATH_TO_LIBS_FOLDER"]

    This is the log file produced by the analysis.

[2018-06-27T18:55Z] grid_controller: System YAML configuration: /workspace/my_user/tmp_bcbio_1.1.0_development/data_dir/galaxy/bcbio_system.yaml [2018-06-27T18:56Z] grid_controller: Timing: organize samples [2018-06-27T18:56Z] grid_controller: ipython: organize_samples [2018-06-27T18:56Z] exeuction_node_20: Using input YAML configuration: /leading_dir/config/DA_1164_samples-merged. yaml [2018-06-27T18:56Z] exeuction_node_20: Checking sample YAML configuration: /leading_dir/config/DA_1164_samples-mer ged.yaml [2018-06-27T18:56Z] exeuction_node_20: Testing minimum versions of installed programs [2018-06-27T18:56Z] grid_controller: ipython: prepare_sample [2018-06-27T18:56Z] grid_controller: Timing: adapter trimming [2018-06-27T18:56Z] grid_controller: ipython: trim_srna_sample [2018-06-27T19:26Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_01/DA_1164_01.clean.fastq.gz with --min_size 16 --min 1 [2018-06-27T20:17Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_02/DA_1164_02.clean.fastq.gz with --min_size 16 --min 1 [2018-06-27T21:00Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_03/DA_1164_03.clean.fastq.gz with --min_size 16 --min 1 [2018-06-27T21:42Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_04/DA_1164_04.clean.fastq.gz with --min_size 16 --min 1 [2018-06-27T22:20Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_05/DA_1164_05.clean.fastq.gz with --min_size 16 --min 1 [2018-06-27T22:55Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_06/DA_1164_06.clean.fastq.gz with --min_size 16 --min 1 [2018-06-27T23:31Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_07/DA_1164_07.clean.fastq.gz with --min_size 16 --min 1 [2018-06-28T00:28Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_08/DA_1164_08.clean.fastq.gz with --min_size 16 --min 1 [2018-06-28T01:11Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_09/DA_1164_09.clean.fastq.gz with --min_size 16 --min 1 [2018-06-28T01:54Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_10/DA_1164_10.clean.fastq.gz with --min_size 16 --min 1 [2018-06-28T02:13Z] grid_controller: Timing: prepare [2018-06-28T02:13Z] grid_controller: ipython: seqcluster_prepare [2018-06-28T03:05Z] exeuction_node_24: Prepare seqs.fastq with -minl 17 -maxl 40 -minc 2 --min_shared 0.1 [2018-06-28T03:08Z] grid_controller: Timing: alignment [2018-06-28T03:08Z] grid_controller: ipython: srna_alignment [2018-06-28T03:08Z] exeuction_node_24: Aligning lane DA_1164_01 with star aligner [2018-06-28T03:11Z] exeuction_node_24: mirdeep2 Rfam file not instaled. Skipping... [2018-06-28T03:11Z] grid_controller: Timing: small RNA annotation [2018-06-28T03:11Z] grid_controller: ipython: srna_annotation [2018-06-28T03:12Z] grid_controller: Timing: cluster [2018-06-28T03:12Z] grid_controller: ipython: seqcluster_cluster [2018-06-28T04:59Z] grid_controller: Timing: quality control [2018-06-28T04:59Z] grid_controller: ipython: pipeline_summary [2018-06-28T04:59Z] exeuction_node_20: QC: DA_1164_01 fastqc

I am not sure how to specify that dnapi should be run for de-novo adapter detection followed by adapter trimming. As far as I can tell dnapi was not used for adapter trimming. The fastqc part of the multiqQC report shows that of the 50bp reads the last 25 bp is almost 100% adapters.

As far as I can tell Capsicum annuum is not in mirbase. Therefore I did not enter a 3 letter species code. I am not sure if it makes sense to just enter the species code of a somewhat related species http://www.mirbase.org/cgi-bin/mirna_summary.pl?org=sly http://www.mirbase.org/cgi-bin/mirna_summary.pl?org=sly Or that I better just don't provide a species code.

The analysis did not seem to produce much results. See the file list at the bottom of this comment. Then again I am also not sure what to expect.

The lack of output might in part be because mirdeep2 Rfam not being installed/found. Should I have done that myself?

[2018-06-28T03:08Z] exeuction_node_24: Aligning lane DA_1164_01 with star aligner [2018-06-28T03:11Z] exeuction_node_24: mirdeep2 Rfam file not instaled. Skipping... What I kind of expect as output for an microRNAseq analysis is:

identification/ filtering of known/discovered non microRNA sequences (either biological (e.g. other RNA's) or adapters) identification of know mircroRNA sequences from mirbase or similar per sample alignment BAM files of the microRNA sequences (not sure if this should run against the genome or transcriptome (or both). And I am not sure if these alignments identify target loci/mRNAs or microRNA precursur loci/mRNAs (or both)) microRNA target mRNA/gene prediction microRNA quantification Do you think it is possible to get the above results using bcbio for microRNAseq data of a non model organism? How would I then do that using bcbio? Is the yaml that I use correct? Should I add tRNA as an expression caller?

Since I am new to microRNAseq the bcbio microRNAseq documentation is also a bit short me. I would also very much appreciate it if you can point to me a recent sort of best practice method / review paper that describes the method(s) that bcbio in general tries to provide for microRNAseq analysis.

Thank you very much!

final/ final/DA_1164_05 final/DA_1164_05/qc final/DA_1164_05/qc/fastqc final/DA_1164_05/qc/fastqc/fastqc_report.html final/DA_1164_05/qc/fastqc/fastqc_data.txt final/DA_1164_05/qc/fastqc/DA_1164_05.zip final/DA_1164_05/qc/fastqc/Per_base_sequence_quality.tsv final/DA_1164_05/qc/fastqc/Per_tile_sequence_quality.tsv final/DA_1164_05/qc/fastqc/Per_sequence_quality_scores.tsv final/DA_1164_05/qc/fastqc/Per_base_sequence_content.tsv final/DA_1164_05/qc/fastqc/Per_sequence_GC_content.tsv final/DA_1164_05/qc/fastqc/Per_base_N_content.tsv final/DA_1164_05/qc/fastqc/Sequence_Length_Distribution.tsv final/DA_1164_05/DA_1164_05-ready.trimming_stats final/DA_1164_04 final/DA_1164_04/qc final/DA_1164_04/qc/fastqc final/DA_1164_04/qc/fastqc/fastqc_report.html final/DA_1164_04/qc/fastqc/fastqc_data.txt final/DA_1164_04/qc/fastqc/DA_1164_04.zip final/DA_1164_04/qc/fastqc/Per_base_sequence_quality.tsv final/DA_1164_04/qc/fastqc/Per_tile_sequence_quality.tsv final/DA_1164_04/qc/fastqc/Per_sequence_quality_scores.tsv final/DA_1164_04/qc/fastqc/Per_base_sequence_content.tsv final/DA_1164_04/qc/fastqc/Per_sequence_GC_content.tsv final/DA_1164_04/qc/fastqc/Per_base_N_content.tsv final/DA_1164_04/qc/fastqc/Sequence_Length_Distribution.tsv final/DA_1164_04/DA_1164_04-ready.trimming_stats final/DA_1164_09 final/DA_1164_09/qc final/DA_1164_09/qc/fastqc final/DA_1164_09/qc/fastqc/fastqc_report.html final/DA_1164_09/qc/fastqc/fastqc_data.txt final/DA_1164_09/qc/fastqc/DA_1164_09.zip final/DA_1164_09/qc/fastqc/Per_base_sequence_quality.tsv final/DA_1164_09/qc/fastqc/Per_tile_sequence_quality.tsv final/DA_1164_09/qc/fastqc/Per_sequence_quality_scores.tsv final/DA_1164_09/qc/fastqc/Per_base_sequence_content.tsv final/DA_1164_09/qc/fastqc/Per_sequence_GC_content.tsv final/DA_1164_09/qc/fastqc/Per_base_N_content.tsv final/DA_1164_09/qc/fastqc/Sequence_Length_Distribution.tsv final/DA_1164_09/DA_1164_09-ready.trimming_stats final/DA_1164_08 final/DA_1164_08/qc final/DA_1164_08/qc/fastqc final/DA_1164_08/qc/fastqc/fastqc_report.html final/DA_1164_08/qc/fastqc/fastqc_data.txt final/DA_1164_08/qc/fastqc/DA_1164_08.zip final/DA_1164_08/qc/fastqc/Per_base_sequence_quality.tsv final/DA_1164_08/qc/fastqc/Per_tile_sequence_quality.tsv final/DA_1164_08/qc/fastqc/Per_sequence_quality_scores.tsv final/DA_1164_08/qc/fastqc/Per_base_sequence_content.tsv final/DA_1164_08/qc/fastqc/Per_sequence_GC_content.tsv final/DA_1164_08/qc/fastqc/Per_base_N_content.tsv final/DA_1164_08/qc/fastqc/Sequence_Length_Distribution.tsv final/DA_1164_08/DA_1164_08-ready.trimming_stats final/DA_1164_07 final/DA_1164_07/qc final/DA_1164_07/qc/fastqc final/DA_1164_07/qc/fastqc/fastqc_report.html final/DA_1164_07/qc/fastqc/fastqc_data.txt final/DA_1164_07/qc/fastqc/DA_1164_07.zip final/DA_1164_07/qc/fastqc/Per_base_sequence_quality.tsv final/DA_1164_07/qc/fastqc/Per_tile_sequence_quality.tsv final/DA_1164_07/qc/fastqc/Per_sequence_quality_scores.tsv final/DA_1164_07/qc/fastqc/Per_base_sequence_content.tsv final/DA_1164_07/qc/fastqc/Per_sequence_GC_content.tsv final/DA_1164_07/qc/fastqc/Per_base_N_content.tsv final/DA_1164_07/qc/fastqc/Sequence_Length_Distribution.tsv final/DA_1164_07/DA_1164_07-ready.trimming_stats final/DA_1164_06 final/DA_1164_06/qc final/DA_1164_06/qc/fastqc final/DA_1164_06/qc/fastqc/fastqc_report.html final/DA_1164_06/qc/fastqc/fastqc_data.txt final/DA_1164_06/qc/fastqc/DA_1164_06.zip final/DA_1164_06/qc/fastqc/Per_base_sequence_quality.tsv final/DA_1164_06/qc/fastqc/Per_tile_sequence_quality.tsv final/DA_1164_06/qc/fastqc/Per_sequence_quality_scores.tsv final/DA_1164_06/qc/fastqc/Per_base_sequence_content.tsv final/DA_1164_06/qc/fastqc/Per_sequence_GC_content.tsv final/DA_1164_06/qc/fastqc/Per_base_N_content.tsv final/DA_1164_06/qc/fastqc/Sequence_Length_Distribution.tsv final/DA_1164_06/DA_1164_06-ready.trimming_stats final/DA_1164_01 final/DA_1164_01/qc final/DA_1164_01/qc/fastqc final/DA_1164_01/qc/fastqc/fastqc_report.html final/DA_1164_01/qc/fastqc/fastqc_data.txt final/DA_1164_01/qc/fastqc/DA_1164_01.zip final/DA_1164_01/qc/fastqc/Per_base_sequence_quality.tsv final/DA_1164_01/qc/fastqc/Per_tile_sequence_quality.tsv final/DA_1164_01/qc/fastqc/Per_sequence_quality_scores.tsv final/DA_1164_01/qc/fastqc/Per_base_sequence_content.tsv final/DA_1164_01/qc/fastqc/Per_sequence_GC_content.tsv final/DA_1164_01/qc/fastqc/Per_base_N_content.tsv final/DA_1164_01/qc/fastqc/Sequence_Length_Distribution.tsv final/DA_1164_01/qc/small-rna final/DA_1164_01/qc/small-rna/DA_1164_01.txt final/DA_1164_01/DA_1164_01-ready.trimming_stats final/DA_1164_03 final/DA_1164_03/qc final/DA_1164_03/qc/fastqc final/DA_1164_03/qc/fastqc/fastqc_report.html final/DA_1164_03/qc/fastqc/fastqc_data.txt final/DA_1164_03/qc/fastqc/DA_1164_03.zip final/DA_1164_03/qc/fastqc/Per_base_sequence_quality.tsv final/DA_1164_03/qc/fastqc/Per_tile_sequence_quality.tsv final/DA_1164_03/qc/fastqc/Per_sequence_quality_scores.tsv final/DA_1164_03/qc/fastqc/Per_base_sequence_content.tsv final/DA_1164_03/qc/fastqc/Per_sequence_GC_content.tsv final/DA_1164_03/qc/fastqc/Per_base_N_content.tsv final/DA_1164_03/qc/fastqc/Sequence_Length_Distribution.tsv final/DA_1164_03/DA_1164_03-ready.trimming_stats final/DA_1164_02 final/DA_1164_02/qc final/DA_1164_02/qc/fastqc final/DA_1164_02/qc/fastqc/fastqc_report.html final/DA_1164_02/qc/fastqc/fastqc_data.txt final/DA_1164_02/qc/fastqc/DA_1164_02.zip final/DA_1164_02/qc/fastqc/Per_base_sequence_quality.tsv final/DA_1164_02/qc/fastqc/Per_tile_sequence_quality.tsv final/DA_1164_02/qc/fastqc/Per_sequence_quality_scores.tsv final/DA_1164_02/qc/fastqc/Per_base_sequence_content.tsv final/DA_1164_02/qc/fastqc/Per_sequence_GC_content.tsv final/DA_1164_02/qc/fastqc/Per_base_N_content.tsv final/DA_1164_02/qc/fastqc/Sequence_Length_Distribution.tsv final/DA_1164_02/DA_1164_02-ready.trimming_stats final/DA_1164_10 final/DA_1164_10/qc final/DA_1164_10/qc/fastqc final/DA_1164_10/qc/fastqc/fastqc_report.html final/DA_1164_10/qc/fastqc/fastqc_data.txt final/DA_1164_10/qc/fastqc/DA_1164_10.zip final/DA_1164_10/qc/fastqc/Per_base_sequence_quality.tsv final/DA_1164_10/qc/fastqc/Per_tile_sequence_quality.tsv final/DA_1164_10/qc/fastqc/Per_sequence_quality_scores.tsv final/DA_1164_10/qc/fastqc/Per_base_sequence_content.tsv final/DA_1164_10/qc/fastqc/Per_sequence_GC_content.tsv final/DA_1164_10/qc/fastqc/Per_base_N_content.tsv final/DA_1164_10/qc/fastqc/Sequence_Length_Distribution.tsv final/DA_1164_10/DA_1164_10-ready.trimming_stats final/2018-06-28_DA_1164_samples-merged final/2018-06-28_DA_1164_samples-merged/programs.txt final/2018-06-28_DA_1164_samples-merged/bcbio-nextgen.log final/2018-06-28_DA_1164_samples-merged/bcbio-nextgen-commands.log final/2018-06-28_DA_1164_samples-merged/project-summary.yaml final/2018-06-28_DA_1164_samples-merged/report final/2018-06-28_DA_1164_samples-merged/report/srna_report.rmd final/2018-06-28_DA_1164_samples-merged/report/summary.csv final/2018-06-28_DA_1164_samples-merged/multiqc final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_report.html final/2018-06-28_DA_1164_samples-merged/multiqc/report final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_08_bcbio.txt final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_04_bcbio.txt final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_07_bcbio.txt final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_06_bcbio.txt final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_02_bcbio.txt final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_05_bcbio.txt final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_10_bcbio.txt final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_01_bcbio.txt final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_09_bcbio.txt final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_03_bcbio.txt final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_config.yaml final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_data final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_data/multiqc_data_final.json final/2018-06-28_DA_1164_samples-merged/multiqc/list_files_final.txt final/2018-06-28_DA_1164_samples-merged/seqcluster final/2018-06-28_DA_1164_samples-merged/seqcluster/log final/2018-06-28_DA_1164_samples-merged/seqcluster/log/run.log final/2018-06-28_DA_1164_samples-merged/seqcluster/log/trace.log final/2018-06-28_DA_1164_samples-merged/seqcluster/seqs_rmlw.bam_cov.tsv final/2018-06-28_DA_1164_samples-merged/seqcluster/read_stats.tsv final/2018-06-28_DA_1164_samples-merged/seqcluster/cluster.bed final/2018-06-28_DA_1164_samples-merged/seqcluster/list_obj.pk final/2018-06-28_DA_1164_samples-merged/seqcluster/list_obj_red.pk final/2018-06-28_DA_1164_samples-merged/seqcluster/counts.tsv final/2018-06-28_DA_1164_samples-merged/seqcluster/size_counts.tsv final/2018-06-28_DA_1164_samples-merged/seqcluster/positions.bed final/2018-06-28_DA_1164_samples-merged/seqcluster/counts_sequence.tsv final/2018-06-28_DA_1164_samples-merged/seqcluster/seqcluster.json final/2018-06-28_DA_1164_samples-merged/seqclusterViz final/2018-06-28_DA_1164_samples-merged/seqclusterViz/log final/2018-06-28_DA_1164_samples-merged/seqclusterViz/log/run.log final/2018-06-28_DA_1164_samples-merged/seqclusterViz/log/trace.log final/2018-06-28_DA_1164_samples-merged/seqclusterViz/profiles final/2018-06-28_DA_1164_samples-merged/seqclusterViz/profiles/344 final/2018-06-28_DA_1164_samples-merged/seqclusterViz/profiles/5 final/2018-06-28_DA_1164_samples-merged/seqclusterViz/seqcluster.db — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2427, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HPT3EnjSkz6UUAtjeWebu5pIAmf-ks5uBdodgaJpZM4U8lIW.

WimSpee commented 6 years ago

Hi Lorena Pantano.

Thank you for the information. I did not know that plant specific tools were needed. Do you mean any of these two tools? miRPlant: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-275 miRDeep-P: https://academic.oup.com/bioinformatics/article/27/18/2614/181153

The first paper mentions that different tools are needed because of that the miRNA precursors are different / longer in plants than in animals.

The most challenging problem in identifying novel plant miRNA is to find a suitable genomic region as a miRNA precursor candidate (to test whether it forms hairpins) because the majority of precursor miRNA in plants are between 100-200 bp [4], which is much longer than those in animals.

Do you know if there are other reasons plant specific miRNA tools are needed?

I will try to use / look at the seqcluster results.

I will try with trim_reads : True .

Also I will try the analysis with Solanum lycopersicum (mirbase SLY) as the known miRNA data set. That species is some what close (also in the nightshade family), and the miRNA seqeunces are conserved in plants according to one of the above papers. http://www.mirbase.org/cgi-bin/mirna_summary.pl?org=sly

Do I need to do anything to make use of the the SLY miRNA known sequences?

Do you know how and by who the sequences for a species get added in mirbase?

It would be nice if the microRNA seq functionality of bcbio works for plants. I kind of hoped it would / did not expect a need plant specific tools. At the same time I understand your primary focus is on other species, thus it would only make sense to me if it's not to much work or we could do part of it.

Thanks again for the information.

lpantano commented 6 years ago

Hi,

Yes, those are godo examples.

I would be happy to add this to bcbio. What it would help is if you use some of these tools and tell us the command line you use. We try to implement tools that we can test with some data and we know kind of give good results. If you do that and find some, I’ll be happy to add it, for sure.

About using this other species with bcbio, that should work. If you have installed the hg19 and mm10 genome you could generate the same files and put it in the right location.

I am assuming you have the genome set up, and you can locate genome_name/build_name/seq folder used by bcbio when you ran the small RNA seq pipeline.

You need to add the following to the genome-resources.yaml file:

srnaseq: srna_transcripts: ../srnaseq/srna-transcripts.gtf mirbase_hairpin: ../srnaseq/hairpin.fa mirbase_mature: ../srnaseq/mature.fa

In the srnaseq folder, that is at the same level than the seq folder, you need these files:

From mirbase you can download the mature.fa and hairpin.fa and the miRNA.str (ftp://mirbase.org/pub/mirbase/CURRENT/ ftp://mirbase.org/pub/mirbase/CURRENT/)

You’ll need to prepare you’ll files like this:

zcat hairpin.fa.gz | awk '{if ($0~/>sly/){name=$0; print name} else if ($0~/^>/){name=0};if (name!=0 && $0!~/^>/){print $0;}}' | sed 's/U/T/g' > hairpin.fa

zcat mature.fa.gz | awk '{if ($0~/>sly/){name=$0; print name} else if ($0~/^>/){name=0};if (name!=0 && $0!~/^>/){print $0;}}' | sed 's/U/T/g' > mature.fa

zcat miRNA.str.gz | awk '{if ($0~/sly/)print $0}' > miRNA.str

You can use a custom gtf from your species, so if you have a genome ensemble and there is a gene annotation (ftp://ftp.ensemblgenomes.org/pub/release-39/plants/gtf/solanum_lycopersicum), you can put that one there as the srna-transcripts.gtf. That file is used by seqcluster to annotate the clusters found. I think it should work from the ensembl database if you are using that genome.

If you get the mirna part and the cluster working you can use this package to load all the data with this package:

https://lpantano.github.io/bcbioSmallRna/reference/loadSmallRnaRun.html

Beside you have this kind of template for a quick QC analysis and how to get the count data and annotation:

https://github.com/lpantano/bcbioSmallRna/blob/master/inst/rmarkdown/templates/srnaseq/skeleton/skeleton.Rmd https://github.com/lpantano/bcbioSmallRna/blob/master/inst/rmarkdown/templates/srnaseq/skeleton/skeleton.Rmd

I am working on this currently, so It is a good time to start using it.

I hope this helps.

Cheers

On Jul 4, 2018, at 10:25 AM, WimSpee notifications@github.com wrote:

Hi Lorena Pantano.

Thank you for the information. I did not know that plant specific tools were needed. Do you mean any of these two tools? miRPlant: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-275 https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-275 miRDeep-P: https://academic.oup.com/bioinformatics/article/27/18/2614/181153 https://academic.oup.com/bioinformatics/article/27/18/2614/181153 The first paper mentions that different tools are needed because of that the miRNA precursors are different / longer in plants than in animals.

The most challenging problem in identifying novel plant miRNA is to find a suitable genomic region as a miRNA precursor candidate (to test whether it forms hairpins) because the majority of precursor miRNA in plants are between 100-200 bp [4], which is much longer than those in animals.

Do you know if there are other reasons plant specific miRNA tools are needed?

I will try to use / look at the seqcluster results.

I will try with trim_reads : True .

Also I will try the analysis with Solanum lycopersicum (mirbase SLY) as the known miRNA data set. That species is some what close (also in the nightshade family), and the miRNA seqeunces are conserved in plants according to one of the above papers. http://www.mirbase.org/cgi-bin/mirna_summary.pl?org=sly http://www.mirbase.org/cgi-bin/mirna_summary.pl?org=sly Do I need to do anything to make use of the the SLY miRNA known sequences?

Do you know how and by who the sequences for a species get added in mirbase?

It would be nice if the microRNA seq functionality of bcbio works for plants. I kind of hoped it would / did not expect a need plant specific tools. At the same time I understand your primary focus is on other species, thus it would only make sense to me if it's not to much work or we could do part of it.

Thank again for the information.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2427#issuecomment-402494179, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HLRgQ3rv51mlkTjyxlXivd10lzhzks5uDNA_gaJpZM4U8lIW.

WimSpee commented 6 years ago

Hi @lpantano . I tried to run the same analysis with trim_reads : true and species: sly.

This resulted in the following error during adapter removal:

[2018-07-10T16:51Z] execution_machine_x2: 2018-07-10 18:51:49,585 INFO: This is Atropos 1.1.18 with Python 3.6.5
[2018-07-10T16:51Z] execution_machine_x2: 2018-07-10 18:51:49,590 INFO: Trimming 0 adapter with at most 10.0% errors in single-end mode ...
[2018-07-10T16:52Z] execution_machine_x2: =======
[2018-07-10T16:52Z] execution_machine_x2: Atropos
[2018-07-10T16:52Z] execution_machine_x2: =======
[2018-07-10T16:52Z] execution_machine_x2: Atropos version: 1.1.18
[2018-07-10T16:52Z] execution_machine_x2: Python version: 3.6.5
[2018-07-10T16:52Z] execution_machine_x2: Command line parameters: trim --max-reads 500000 -u 22 -se /data/run/Projects/DA-1164/input_fastq/concat/DA_1164_10.fastq.gz -o /data/run/Projects/DA-1164/DA_1164_samples-merged/work/bcbiotx/tmpHUH_oP/DA_1164_10end.fastq.gz
[2018-07-10T16:52Z] execution_machine_x2: Sample ID: DA_1164_10
[2018-07-10T16:52Z] execution_machine_x2: Input format: FASTQ, Read 1, w/ Qualities
[2018-07-10T16:52Z] execution_machine_x2: Input files:
[2018-07-10T16:52Z] execution_machine_x2:   /data/run/Projects/DA-1164/input_fastq/concat/DA_1164_10.fastq.gz
[2018-07-10T16:52Z] execution_machine_x2: Start time: 2018-07-10T18:51:49.589469
[2018-07-10T16:52Z] execution_machine_x2: Wallclock time: 15.54 s (31 us/read; 1.93 M reads/minute)
[2018-07-10T16:52Z] execution_machine_x2: CPU time (main process): 11.20 s
[2018-07-10T16:52Z] execution_machine_x2: --------
[2018-07-10T16:52Z] execution_machine_x2: Trimming
[2018-07-10T16:52Z] execution_machine_x2: --------
[2018-07-10T16:52Z] execution_machine_x2: Reads                                  records   fraction
[2018-07-10T16:52Z] execution_machine_x2: ----------------------------------- ---------- ----------
[2018-07-10T16:52Z] execution_machine_x2: Total reads processed:                 500,000
[2018-07-10T16:52Z] execution_machine_x2: Reads written (passing filters):       500,000     100.0%
[2018-07-10T16:52Z] execution_machine_x2: Base pairs                                  bp   fraction
[2018-07-10T16:52Z] execution_machine_x2: ----------------------------------- ---------- ----------
[2018-07-10T16:52Z] execution_machine_x2: Total bp processed:                 25,500,000
[2018-07-10T16:52Z] execution_machine_x2: Cut unconditionally                 11,000,000      43.1%
[2018-07-10T16:52Z] execution_machine_x2: Total bp written (filtered):        14,500,000      56.9%
[2018-07-10T16:52Z] execution_machine_x2: Unexpected error
Traceback (most recent call last):
  File "/home/my_user/workspace/tmp_bcbio_1.1.0_development/data_dir/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 51, in _setup_logging
    yield config
  File "/home/my_user/workspace/tmp_bcbio_1.1.0_development/data_dir/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 92, in trim_srna_sample
    return ipython.zip_args(apply(srna.trim_srna_sample, *args))
  File "/home/my_user/workspace/tmp_bcbio_1.1.0_development/data_dir/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 61, in trim_srna_sample
    adapters = adapter if adapter else _dnapi_prediction(in_file, out_dir)
  File "/home/my_user/workspace/tmp_bcbio_1.1.0_development/data_dir/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 157, in _dnapi_prediction
    max_score = iterative_result[1][1]
IndexError: list index out of range

This seems to be sample specific. Other samples seem not to run into this error.

[2018-07-10T12:54Z] execution_machine_x2: 2018-07-10 14:54:13,644 INFO: This is Atropos 1.1.18 with Python 3.6.5
[2018-07-10T12:54Z] execution_machine_x2: 2018-07-10 14:54:13,655 INFO: Trimming 0 adapter with at most 10.0% errors in single-end mode ...
[2018-07-10T12:54Z] execution_machine_x2: =======
[2018-07-10T12:54Z] execution_machine_x2: Atropos
[2018-07-10T12:54Z] execution_machine_x2: =======
[2018-07-10T12:54Z] execution_machine_x2: Atropos version: 1.1.18
[2018-07-10T12:54Z] execution_machine_x2: Python version: 3.6.5
[2018-07-10T12:54Z] execution_machine_x2: Command line parameters: trim --max-reads 500000 -u 22 -se /data/run/Projects/DA-1164/input_fastq/
concat/DA_1164_01.fastq.gz -o /data/run/Projects/DA-1164/DA_1164_samples-merged/work/bcbiotx/tmpEtG7q8/DA_1164_01end.fastq.gz
[2018-07-10T12:54Z] execution_machine_x2: Sample ID: DA_1164_01
[2018-07-10T12:54Z] execution_machine_x2: Input format: FASTQ, Read 1, w/ Qualities
[2018-07-10T12:54Z] execution_machine_x2: Input files:
[2018-07-10T12:54Z] execution_machine_x2:   /data/run/Projects/DA-1164/input_fastq/concat/DA_1164_01.fastq.gz
[2018-07-10T12:54Z] execution_machine_x2: Start time: 2018-07-10T14:54:13.654689
[2018-07-10T12:54Z] execution_machine_x2: Wallclock time: 15.33 s (31 us/read; 1.96 M reads/minute)
[2018-07-10T12:54Z] execution_machine_x2: CPU time (main process): 11.04 s
[2018-07-10T12:54Z] execution_machine_x2: --------
[2018-07-10T12:54Z] execution_machine_x2: Trimming
[2018-07-10T12:54Z] execution_machine_x2: --------
[2018-07-10T12:54Z] execution_machine_x2: Reads                                  records   fraction
[2018-07-10T12:54Z] execution_machine_x2: ----------------------------------- ---------- ----------
[2018-07-10T12:54Z] execution_machine_x2: Total reads processed:                 500,000
[2018-07-10T12:54Z] execution_machine_x2: Reads written (passing filters):       500,000     100.0%
[2018-07-10T12:54Z] execution_machine_x2: Base pairs                                  bp   fraction
[2018-07-10T12:54Z] execution_machine_x2: ----------------------------------- ---------- ----------
[2018-07-10T12:54Z] execution_machine_x2: Total bp processed:                 25,500,000
[2018-07-10T12:54Z] execution_machine_x2: Cut unconditionally                 11,000,000      43.1%
[2018-07-10T12:54Z] execution_machine_x2: Total bp written (filtered):        14,500,000      56.9%
[2018-07-10T12:54Z] execution_machine_x2: Adding adapter to the list: TGGAATTCTCGGG with score 282.8354
[2018-07-10T12:54Z] execution_machine_x2: Adding adapter to the list: GGTGCCAAGGAA with score 78.588
[2018-07-10T12:54Z] execution_machine_x2: remove adapter for DA_1164_01
[2018-07-10T14:02Z] execution_machine_x2: Collapsing /data/run/Projects/DA-1164/DA_1164_samples-merged/work/trimmed/DA_1164_01/DA_1164
_01.clean.fastq.gz with --min_size 16 --min 1

For this sample I am also not sure why Atropos is run before _dnapi_prediction and where Atropos get's the TGGAATTCTCGGG and GGTGCCAAGGAA adapters from.

lpantano commented 6 years ago

Hi,

sorry about this. It seems that the tool we used to predict the adapter is not working there. If you know the 3' adapter, I'll suggest to add the adapter to the adapters: [] to the config file:

https://github.com/bcbio/bcbio-nextgen/blob/master/config/templates/illumina-srnaseq.yaml#L8

If you don't know you can ask the sequencing core for that.

In this case, I'll suggest to start from scratch the analysis.

Let me know if that helps.

roryk commented 5 years ago

Thanks, closing this as it seems like its been answered.