bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
990 stars 355 forks source link

small rna pipeline #2339

Closed ahsen1402 closed 5 years ago

ahsen1402 commented 6 years ago

Hi,

First of all thanks a lot for your help and this is kinda continuation of thread #2335 . After the update I started running the code but something still confuses me just for clarification this is the yaml file:

upload: dir: bcbio_analysis/upload details:

Since I pretrimmed the data using cutadapt, I wanted to skip the trimming part which I thought the above file would do.However, this is from the log file

[2018-03-19T17:27Z] Timing: adapter trimming [2018-03-19T17:27Z] multiprocessing: trim_srna_sample [2018-03-19T21:10Z] Timing: prepare [2018-03-19T21:10Z] multiprocessing: seqcluster_prepare [2018-03-19T21:15Z] Timing: alignment [2018-03-19T21:15Z] multiprocessing: srna_alignment [2018-03-19T21:15Z] Aligning lane 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1 with star aligner [2018-03-19T21:17Z] Timing: small RNA annotation

From my understanding it still does some sort of trimming because it tooks 3 hours for doing something. Isnt there a way to totally skip the trimming phase as it will save time and resources.

Thanks in advance

lpantano commented 6 years ago

Sorry about this. I think there is a step to make some stats about the trimming, but It shouldn’t take that much. Let me check it, and will disable it if I found it could be that.

Thanks

On Mar 19, 2018, at 5:34 PM, ahsen1402 notifications@github.com wrote:

Hi,

First of all thanks a lot for your help and this is kinda continuation of thread #2335 https://github.com/bcbio/bcbio-nextgen/issues/2335 . After the update I started running the code but something still confuses me just for clarification this is the yaml file:

upload: dir: bcbio_analysis/upload details:

analysis: smallRNA-seq algorithm: aligner: star species: hsa description: data_no_trim genome_build: hg38 Since I pretrimmed the data using cutadapt, I wanted to skip the trimming part which I thought the above file would do.However, this is from the log file

[2018-03-19T17:27Z] Timing: adapter trimming [2018-03-19T17:27Z] multiprocessing: trim_srna_sample [2018-03-19T21:10Z] Timing: prepare [2018-03-19T21:10Z] multiprocessing: seqcluster_prepare [2018-03-19T21:15Z] Timing: alignment [2018-03-19T21:15Z] multiprocessing: srna_alignment [2018-03-19T21:15Z] Aligning lane 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1 with star aligner [2018-03-19T21:17Z] Timing: small RNA annotation

From my understanding it still does some sort of trimming because it tooks 3 hours for doing something. Isnt there a way to totally skip the trimming phase as it will save time and resources.

Thanks in advance

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HJ25JvPDF72CvdAPTO5jbukb2Ofgks5tgCRJgaJpZM4Sw8OM.

lpantano commented 6 years ago

Actually, I just checked and read the message better.

Yes, there is a collapsing step after trimming that is mandatory to use many of the tools for small RNA, like miRNA annotation.

How many samples do you have? Is it only 1? An how big is the file you had trimmed?

Cheers

On Mar 19, 2018, at 5:34 PM, ahsen1402 notifications@github.com wrote:

Hi,

First of all thanks a lot for your help and this is kinda continuation of thread #2335 https://github.com/bcbio/bcbio-nextgen/issues/2335 . After the update I started running the code but something still confuses me just for clarification this is the yaml file:

upload: dir: bcbio_analysis/upload details:

analysis: smallRNA-seq algorithm: aligner: star species: hsa description: data_no_trim genome_build: hg38 Since I pretrimmed the data using cutadapt, I wanted to skip the trimming part which I thought the above file would do.However, this is from the log file

[2018-03-19T17:27Z] Timing: adapter trimming [2018-03-19T17:27Z] multiprocessing: trim_srna_sample [2018-03-19T21:10Z] Timing: prepare [2018-03-19T21:10Z] multiprocessing: seqcluster_prepare [2018-03-19T21:15Z] Timing: alignment [2018-03-19T21:15Z] multiprocessing: srna_alignment [2018-03-19T21:15Z] Aligning lane 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1 with star aligner [2018-03-19T21:17Z] Timing: small RNA annotation

From my understanding it still does some sort of trimming because it tooks 3 hours for doing something. Isnt there a way to totally skip the trimming phase as it will save time and resources.

Thanks in advance

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HJ25JvPDF72CvdAPTO5jbukb2Ofgks5tgCRJgaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Thanks a lot. So I have 30 samples with a total number of 300 million reads. So about 10 million reads per sample. Just FYI I am using this on a computing cluster with 12 cores and total of 40GB Ram if that makes any difference.

lpantano commented 6 years ago

Hi,

I think the time is making sense to me.

Let me know if you find any issue.

Thanks

On Mar 19, 2018, at 8:35 PM, ahsen1402 notifications@github.com wrote:

Thanks a lot. So I have 30 samples with a total number of 300 million reads. So about 10 million reads per sample. Just FYI I am using this on a computing cluster with 12 cores and total of 40GB Ram if that makes any difference.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-374432410, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HG9e028nq83gT7nlkVIzfDRYqKORks5tgE6-gaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Hi, So I have restarted the analysis thankfully i did not redid things but while running seqcluster after 10 hours it gave the following error:

Traceback (most recent call last): File "user/software_data/bcbio/anaconda/bin/seqcluster", line 11, in load_entry_point('seqcluster==1.2.4a0', 'console_scripts', 'seqcluster')() File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/seqcluster/command_line.py", line 28, in main cluster(kwargs["args"]) File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/seqcluster/make_clusters.py", line 85, in cluster clusLred = _annotate(args, clusLred) File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/seqcluster/make_clusters.py", line 226, in _annotate c = a.intersect(b, wo=True) File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/pybedtools/bedtool.py", line 806, in decorated result = method(self, *args, **kwargs) File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/pybedtools/bedtool.py", line 337, in wrapped decode_output=decode_output, File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/pybedtools/helpers.py", line 356, in call_bedtools raise BEDToolsError(subprocess.list2cmdline(cmds), stderr) pybedtools.helpers.BEDToolsError: Command was: bedtools intersect -wo -b user/software_data/bcbio/genomes/Hsapiens/hg38/srnaseq/srna-transcripts.gtf -a /tmp/100709203.tmpdir/pybedtools.9HOMVn.tmp Error message was: Error: Invalid record in file user/software_data/bcbio/genomes/Hsapiens/hg38/srnaseq/srna-transcripts.gtf. Record is chr17_KI270861v1_alt . gene 0 58914 . - . name SLC43A2; ' returned non-zero exit status 1

Now all this was downloaded automatically? Is there an error on the bcbio files?

Thanks

lpantano commented 6 years ago

Hi,

Sorry about this.

Yes, it is an error in the data file. Coordinates should be 0 based.

I am working in a fix for the data, you would need to update the data for the small RNA, sorry (I’ll ping you in the commit so you know when is ready).

Other solution is to fix the file by yourself doing:

sed -i 's/\t0\t/\t1\t/' srna-transcripts.gtf

When you re-start it will jump directly to that step that is the annotation of the clusters, so it should be quite faster.

Cheers

On Mar 22, 2018, at 8:54 AM, ahsen1402 notifications@github.com wrote:

Hi, So I have restarted the analysis thankfully i did not redid things but while running seqcluster after 10 hours it gave the following error:

Traceback (most recent call last): File "user/software_data/bcbio/anaconda/bin/seqcluster", line 11, in load_entry_point('seqcluster==1.2.4a0', 'console_scripts', 'seqcluster')() File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/seqcluster/command_line.py", line 28, in main cluster(kwargs["args"]) File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/seqcluster/make_clusters.py", line 85, in cluster clusLred = _annotate(args, clusLred) File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/seqcluster/make_clusters.py", line 226, in _annotate c = a.intersect(b, wo=True) File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/pybedtools/bedtool.py", line 806, in decorated result = method(self, *args, **kwargs) File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/pybedtools/bedtool.py", line 337, in wrapped decode_output=decode_output, File "user/software_data/bcbio/anaconda/lib/python2.7/site-packages/pybedtools/helpers.py", line 356, in call_bedtools raise BEDToolsError(subprocess.list2cmdline(cmds), stderr) pybedtools.helpers.BEDToolsError: Command was: bedtools intersect -wo -b user/software_data/bcbio/genomes/Hsapiens/hg38/srnaseq/srna-transcripts.gtf -a /tmp/100709203.tmpdir/pybedtools.9HOMVn.tmp Error message was: Error: Invalid record in file user/software_data/bcbio/genomes/Hsapiens/hg38/srnaseq/srna-transcripts.gtf. Record is chr17_KI270861v1_alt . gene 0 58914 . - . name SLC43A2; ' returned non-zero exit status 1

Now all this was downloaded automatically? Is there an error on the bcbio files?

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-375294087, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HFXhmney7rkBcDwedbT8yB_h8BMJks5tg58RgaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Thanks it picked up where it was left. Just another question. I want to run all analysis that is possible for that I added under algorithms in yaml file expression_caller: [trna,seqcluster,mirdeep2] . Is there some other analysis that is left? I am asking this because all these commands in the markdown file returns empty:

''' files = list.files(file.path(root_path),pattern = "mirbase-ready",recursive = T,full.names = T) files = list.files(file.path(root_path),pattern = "trimming_stats",recursive = T files = list.files(file.path(root_path),pattern = "mirbase-ready",recursive = T,full.names = T) mirdeep2_files = list.files(file.path(root_path),pattern = "novel-ready",recursive = T,full.names = T) ''' I was especially suprised that the miRNA calling has been not done since I specified mirdeep2 as part of the analysis. I would like to know what options you need to give for a complete analysis Best

lpantano commented 6 years ago

Nice!

Not for now. I am working on adding another tool, miRge 2.0 to quantify mirna and other RNAs, as an alternative to the current ones or as a complementary analysis. Probably by end of this month.

Cheers

On Mar 22, 2018, at 6:48 PM, ahsen1402 notifications@github.com wrote:

Thanks it picked up where it was left. Just another question. I want to run all analysis that is possible for that I added under algorithms in yaml file expression_caller: [trna,seqcluster,mirdeep2] . Is there some other analysis that is left?

Best

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-375482491, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HPBjVNwnj-dwgh2771fMg-nssu2Bks5thCoggaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Regarding my comment, if all of the analysis are run why does the commands in RMD file return empty

files = list.files(file.path(root_path),pattern = "mirbase-ready",recursive = T,full.names = T)
files = list.files(file.path(root_path),pattern = "trimming_stats",recursive = T
files = list.files(file.path(root_path),pattern = "mirbase-ready",recursive = T,full.names = T)
mirdeep2_files = list.files(file.path(root_path),pattern = "novel-ready",recursive = T,full.names = T)

is there an name incompatibility between outputs and the report.rmd file?

Thanks

lpantano commented 6 years ago

Can you list the files that you get for the samples, one sample is enough. There should be that novel-ready file there. If not, can you check in the work directory, to make sure mirdeep2 folder exists and is not empty and inside mirdeep2/novel folder you can see files like for each sample?

On Mar 23, 2018, at 8:26 AM, ahsen1402 notifications@github.com wrote:

Regarding my comment, if all of the analysis are run why does the commands in RMD file return empty

files = list.files(file.path(root_path),pattern = "mirbase-ready",recursive = T,full.names = T) files = list.files(file.path(root_path),pattern = "trimming_stats",recursive = T files = list.files(file.path(root_path),pattern = "mirbase-ready",recursive = T,full.names = T) mirdeep2_files = list.files(file.path(root_path),pattern = "novel-ready",recursive = T,full.names = T) is there an name incompatibility between outputs and the report.rmd file?

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-375649452, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HN8_Vbp6oFA-gBA4v1iDFU6VDORmks5thOnzgaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Hi Lorena,

Just to clarify under the folder work: I have

align    checkpoints_parallel  mirbase   project-summary.yaml  qc      rna.ps      srna_out_files  trna_mint
bcbiotx  log                   mirdeep2  provenance            report  seqcluster  trimmed

mirdeep2/novel is not empty but I only see two files: hairpin.fa miRNA.str However, under novel there is counts_mirna_novel.tsv

Under mirbase folder I have a folder per each sample and two addiotional files counts.tsv , counts_mirna.tsv (counts are non zero). And under each folder for a sample I have

10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.bam 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.gff 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.mirna 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.mirna.back 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.mirna.back_summary 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1_novel.mirna 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1_novel.mirna.back 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1_novel.mirna.back_summary

In the reports.html that I created from markdown file I see only seqbluster results and even the size distributions are not there. To check that under work/trimmed there is a folder per sample in which there is a file sample.clean.trimming.fastq_size_stats (nothing like trimming_stats). So what do you think is going wrong?

Thanks

lpantano commented 6 years ago

Thanks for the information. It seems everything is right with the working directory.

Can you show me what you have under the final folder?

A list of the files under one sample folder should be enough.

Thanks

On Mar 23, 2018, at 12:21 PM, ahsen1402 notifications@github.com wrote:

Hi Lorena,

Just to clarify under the folder work: I have

align checkpoints_parallel mirbase project-summary.yaml qc rna.ps srna_out_files trna_mint bcbiotx log mirdeep2 provenance report seqcluster trimmed mirdeep2/novel is not empty but I only see two files: hairpin.fa miRNA.str However, under novel there is counts_mirna_novel.tsv

Under mirbase folder I have a folder per each sample and two addiotional files counts.tsv , counts_mirna.tsv (counts are non zero). And under each folder for a sample I have

10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.bam 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.gff 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.mirna 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.mirna.back 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.mirna.back_summary 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1_novel.mirna 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1_novel.mirna.back 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1_novel.mirna.back_summary

In the reports.html that I created from markdown file I see only seqbluster results and even the size distributions are not there. To check that under work/trimmed there is a folder per sample in which there is a file sample.clean.trimming.fastq_size_stats (nothing like trimming_stats). So what do you think is going wrong?

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-375720547, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HNj2Iw2T02MlyXiudOsThg82DPOsks5thSD1gaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Under work/trimmed/XXX , I have

10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.clean.fastq.gz 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.clean.trimming.fastq 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.clean.trimming.fastq_size_stats 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.fragments.fastq.gz 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.short.fastq.gz 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.log

lpantano commented 6 years ago

Hi,

That is still the working directory. I meant to look inside the folder that you have in your YAML file under upload option.

Do you have that set up? It is supposed to copy all important files to that folder and those names are the ones the Rmd is trying to follow.

Cheers

On Mar 23, 2018, at 3:41 PM, ahsen1402 notifications@github.com wrote:

Under work/trimmed/XXX , I have

10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.clean.fastq.gz 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.clean.trimming.fastq 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.clean.trimming.fastq_size_stats 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.fragments.fastq.gz 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.short.fastq.gz 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.log

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-375777611, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HEVG-qGxRiurQEudjqoQe7q_c9P2ks5thU_4gaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1-mirbase-ready.counts 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1-mirbase-ready.gff 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1-novel-ready.counts 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1-ready.trimming_stats 10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1-transcriptome.bam mintmap qc

Now I see. I guess the issue is that in the automatically generated RMD file the root path is not the final upload directory. In my case it is defined as analysis/work v.s. it should be analysis/upload ?

ahsen1402 commented 6 years ago

Hi @lpantano , now I am running isomir 1.6, when running

obj <- IsomirDataSeqFromFiles(files = files[rownames(design)], coldata = design[,1:2])

I get the following error, Error in mutate_impl(.data, dots) : Evaluation error: false must be type character, not integer.

Do you know what might be the cause?

Thanks

lpantano commented 6 years ago

Hi,

Sorry about the issue.

Can you paste here the output of files[rownames(design)] and design[,1:2]?

Another way to know if the error is due to the input parameters is to try with only two files and see if the error persists.

Thanks

On Mar 23, 2018, at 10:40 PM, ahsen1402 notifications@github.com wrote:

Hi @lpantano https://github.com/lpantano , now I am running isomir 1.6, when running

obj <- IsomirDataSeqFromFiles(files = files[rownames(design)], coldata = design[,1:2]) I get the following error, Error in mutate_impl(.data, dots) : Evaluation error: false must be type character, not integer.

Do you know what might be the cause?

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-375840864, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HBmdv4OL59pXC_Al8lrv6MqSnrm7ks5thbIOgaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Hi Lorena, First the example in the help section works but I tried your other suggestion which did not work. I tried a dummy design similar to the one you have in the help section which does not work. files[rownames(design)][1:2]

files[rownames(design)][1:2]
 9B-lp1_ATTACTCG-TATAGCCT_AH3VFYBCX2_L002_001_R1 
 "9B-lp1_ATTACTCG-TATAGCCT_AH3VFYBCX2_L002_001_R1/9B-lp1_ATTACTCG-TATAGCCT_AH3VFYBCX2_L002_001_R1-mirbase-ready.counts" 
19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L001_001_R1 
"19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L001_001_R1/19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L001_001_R1-mirbase-ready.counts" 

and design[1:2,1:2]

                                                                                                         sample_id.1 replicate
9B-lp1_ATTACTCG-TATAGCCT_AH3VFYBCX2_L002_001_R1           9B-lp1         2
19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L001_001_R1 19A-3ng-lp1         1

I put only the first two entries and to make sure the command

obj <- IsomirDataSeqFromFiles(files = files[rownames(design)][1:2], coldata = design[1:2,1:2] , header = T, skip=0)

gave the same error.

Thanks

lpantano commented 6 years ago

Hi,

Thanks for trying.

Any chance you can install the package from the repo:

Devtools::install_github(“lpantano/isomiRs”) and see if the new version works?

If not, any chance you can send me the files so I can debug further?

Cheers

On Mar 26, 2018, at 12:24 PM, ahsen1402 notifications@github.com wrote:

Hi Lorena, First the example in the help section works but I tried your other suggestion which did not work. I tried a dummy design similar to the one you have in the help section which does not work. files[rownames(design)][1:2]

files[rownames(design)][1:2] 9B-lp1_ATTACTCG-TATAGCCT_AH3VFYBCX2_L002_001_R1 "9B-lp1_ATTACTCG-TATAGCCT_AH3VFYBCX2_L002_001_R1/9B-lp1_ATTACTCG-TATAGCCT_AH3VFYBCX2_L002_001_R1-mirbase-ready.counts" 19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L001_001_R1 "19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L001_001_R1/19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L001_001_R1-mirbase-ready.counts"

and design[1:2,1:2]

                                                                                                     sample_id.1 replicate

9B-lp1_ATTACTCG-TATAGCCT_AH3VFYBCX2_L002_001_R1 9B-lp1 2 19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L001_001_R1 19A-3ng-lp1 1 I put only the first two entries and to make sure the command

obj <- IsomirDataSeqFromFiles(files = files[rownames(design)][1:2], coldata = design[1:2,1:2] , header = T, skip=0) gave the same error.

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-376226121, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HKfPmhJzrH_iNkIigynKuDuLc_OLks5tiRY4gaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Hi Lorena,

This time it did actually work but i got these warnings which might help you:

Skipping sample /sc/orga/projects/stolog01a/bcbio_analysis/upload1/19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L001_001_R1/19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L001_001_R1-mirbase-ready.counts. Low number of hits according to minHits.
Skipping sample /sc/orga/projects/stolog01a/bcbio_analysis/upload1/19A-0-3ng-lp2_ATTCAGAA-TATAGCCT_AH3VFYBCX2_L001_001_R1/19A-0-3ng-lp2_ATTCAGAA-TATAGCCT_AH3VFYBCX2_L001_001_R1-mirbase-ready.counts. Low number of hits according to minHits.
This sample hasn't any lines: /sc/orga/projects/stolog01a/bcbio_analysis/upload1/19A-3ng-lp3_AGCGATAG-GGCTCTGA_AH3VFYBCX2_L001_001_R1/19A-3ng-lp3_AGCGATAG-GGCTCTGA_AH3VFYBCX2_L001_001_R1-mirbase-ready.counts
This sample hasn't any lines: /sc/orga/projects/stolog01a/bcbio_analysis/upload1/19A-3ng-lp2_TCTCGCGC-GGCTCTGA_AH3VFYBCX2_L002_001_R1/19A-3ng-lp2_TCTCGCGC-GGCTCTGA_AH3VFYBCX2_L002_001_R1-mirbase-ready.counts
Skipping sample /sc/orga/projects/stolog01a/bcbio_analysis/upload1/19A-0-3ng-lp1_GAGATTCC-TATAGCCT_AH3VFYBCX2_L002_001_R1/19A-0-3ng-lp1_GAGATTCC-TATAGCCT_AH3VFYBCX2_L002_001_R1-mirbase-ready.counts. Low number of hits according to minHits.
Skipping sample /sc/orga/projects/stolog01a/bcbio_analysis/upload1/19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L002_001_R1/19A-3ng-lp1_TCCGCGAA-GGCTCTGA_AH3VFYBCX2_L002_001_R1-mirbase-ready.counts. Low number of hits according to minHits.
This sample hasn't any lines: /sc/orga/projects/stolog01a/bcbio_analysis/upload1/19A-3ng-lp2_TCTCGCGC-GGCTCTGA_AH3VFYBCX2_L001_001_R1/19A-3ng-lp2_TCTCGCGC-GGCTCTGA_AH3VFYBCX2_L001_001_R1-mirbase-ready.counts
Total samples filtered due to low number of hits: 7

Thanks a lot

ahsen1402 commented 6 years ago

Hi,

I am running bcbio small rna analysis on a cluster with 12 nodes with 3.5 GB memory each node. When I look at the log files I see

[2018-03-27T19:49Z] Resource requests: miraligner, picard; memory: 3.00, 3.00; cores: 16, 16
[2018-03-27T19:49Z] Configuring 1 jobs to run, using 1 cores each with 3.00g of memory reserved for each job

It looks like I can not run the algorithm in parallel and as such it is taking too much time. Am I missing something on the parallelization part or is there a way to make the algorithm take advantages of all the 12 nodes.

Thanks

lpantano commented 6 years ago

Hi,

Sorry about that. I can think in a couple of reasons:

1-If this is a re-start run make sure to remove the files under checkpoint_parallel folder 2-Can you look for the batch file send to the cluster for this step and look at the content? That step for sure is 1 core for each sample and some times there is a single batch file send to the cluster with multiple process inside and it should ask 1 core for each process. If that file looks like having 12 process and asking for 12 cores then all would seem good.

Cheers

On Mar 27, 2018, at 4:10 PM, ahsen1402 notifications@github.com wrote:

Hi,

I am running bcbio small rna analysis on a cluster with 12 nodes with 3.5 GB memory each node. When I look at the log files I see

[2018-03-27T19:49Z] Resource requests: miraligner, picard; memory: 3.00, 3.00; cores: 16, 16 [2018-03-27T19:49Z] Configuring 1 jobs to run, using 1 cores each with 3.00g of memory reserved for each job It looks like I can not run the algorithm in parallel and as such it is taking too much time. Am I missing something on the parallelization part or is there a way to make the algorithm take advantages of all the 12 nodes.

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-376658600, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HBibLbUuUaVVhf8Pe_50KIebBSINks5tipzQgaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Thanks. It was not a restart. Where to I find this batch file? Another question I have is the file xx.mirbase-ready.counts has the alignments to the mature miRNA which has length of ~22. During the analysis do you also align to the hairpin? If so which file does have this information. If not can we somehow incorporate it?

Thanks

lpantano commented 6 years ago

Hi,

Sorry, the reads mapped to precursor are removed because normally is very low mapped reads and are not stable sequences when you are doing small RNAseq keeping only 20-40 nt long sequences from your samples.

I can change the code to keep it and it would appear in the same file but the DB columns would say something different than miRNA.

Would that be enough for you?

Batch files should be in your working folder and contains the resources. Normally they contain the name engine as the second word in the file name.

Cheers

On Mar 28, 2018, at 7:55 PM, ahsen1402 notifications@github.com wrote:

Thanks. It was not a restart. Where to I find this batch file? Another question I have is the file xx.mirbase-ready.counts has the alignments to the mature miRNA which has length of ~22. During the analysis do you also align to the hairpin? If so which file does have this information. If not can we somehow incorporate it?

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-377076372, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HC7sc7CxdIjxHNC70lbnOZK_36CJks5tjCLvgaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Hi Lorena,

My issue was when I check the read distribution some of my samples has a peak at 22 but for those samples the tool can only detect a few reads in mirBase count file. For my other samples the length distribution is more uniform from 16-100 bp but even for them out of 5 million initial reads I have about 10k reads mapped to mirBase. This are samples we expected some miRNA. So I am trying understand the data if not miRNAs what are these reads come from. Any ideas?

lpantano commented 6 years ago

Hi,

I see. It is weird to have small RNAs longer than 50, can you tell me your library preparation protocol? It was specifics for small RNA data (20-40 nt long).

Things you can look at:

1-run the RNAseq pipeline and see how much do you have mapped to genes. This will be shown in multiqc_report.html (as well, how much rRNA, this will be in project*yaml file in the final folder for each sample)

2- if you ran seqcluster, then the output of that should tell u little about the annotation. I would recommend to install: https://github.com/lpantano/bcbioSmallRna/tree/master https://github.com/lpantano/bcbioSmallRna/tree/master and load your data with:

https://github.com/lpantano/bcbioSmallRna/blob/master/R/loadRun.R#L27 https://github.com/lpantano/bcbioSmallRna/blob/master/R/loadRun.R#L27

And then use the function:

https://github.com/lpantano/bcbioSmallRna/blob/master/R/plots-smallrna.R#L18 https://github.com/lpantano/bcbioSmallRna/blob/master/R/plots-smallrna.R#L18

I can help if there are issues.

Cheers

On Mar 29, 2018, at 9:45 AM, ahsen1402 notifications@github.com wrote:

Hi Lorena,

My issue was when I check the read distribution some of my samples has a peak at 22 but for those samples the tool can only detect a few reads in mirBase count file. For my other samples the length distribution is more uniform from 16-100 bp but even for them out of 5 million initial reads I have about 10k reads mapped to mirBase. This are samples we expected some miRNA. So I am trying understand the data if not miRNAs what are these reads come from. Any ideas?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-377239966, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HGOY9AVaWPHzJXKdA89r5wOKAzDyks5tjOWRgaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Hi,

I run my project using hg38 before. That run successfully. I have now changed the code and made it hg19. However I get the following error: (Can you let me know whats the issue):

 ['gff', '--sps', 'hsa', '--hairpin', 'software_data/bcbio/genomes/Hsapiens/hg19/srnaseq/hairpin.fa', '--gtf', 'software_data/bcbio/genomes/Hsapiens/hg19/srnaseq/mirbase.gff3', '--format', 'seqbuster', '-o', 'work/bcbiotx/tmpdScM6i', 'work/mirbase/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.mirna']

IOError: [Errno 2] No such file or directory: '/sc/orga/projects/stolog01a/bcbio_analysis/bcbio_upgrade_code/exosome_samples_all_analysis_hg_19/work/bcbiotx/tmpdScM6i/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.gff'

Can you guess whats the issue?

Thanks

ahsen1402 commented 6 years ago

Another issue that I have is I needed to trim adapters using the command -u 3 -a AAAAAAAAAA. To do that I configured the yaml file as follows:

upload:
  dir: bcbio_analysis/upload4
details:
  - analysis: smallRNA-seq
    algorithm:
      aligner: star
      species: hsa
      adapters: ["AAAAAAAAAA"]
      expression_caller: [trna,seqcluster,mirdeep2]
    description: exosome_all_samples_hg_19_trim
    genome_build: hg19
resources:
  atropos:
    options: ["-u 3 -a AAAAAAAAAA"]

Now when I look into the file vi bcbio-nextgen-commands.log I see the following codes:

[2018-04-24T02:01Z] bcbio/anaconda/bin/../envs/python3/bin/atropos   -a AAAAAAAAAA --untrimmed-output=bcbio_analysis/bcbio_upgrade_code/exosome_all_samples_hg_19_trim/work/trimmed/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.fragments.fastq.gz -o bcbio_analysis/bcbio_upgrade_code/exosome_all_samples_hg_19_trim/work/bcbiotx/tmpnYiV0M/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.clean.fastq.gz -m 17 --overlap=8 -se cell_line_exosome_rna_seq/raw_fastq/Sample_10B-lp1/fastq/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.fastq.gz --too-short-output bcbio_analysis/bcbio_upgrade_code/exosome_all_samples_hg_19_trim/work/trimmed/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.short.fastq.gz | tee > bcbio_analysis/bcbio_upgrade_code/exosome_all_samples_hg_19_trim/work/trimmed/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001_R1.log
[2018-04-24T02:10Z] bcbio/anaconda/bin/../envs/python3/bin/atropos  -u 3 -a AAAAAAAAAA -se bcbio_analysis/bcbio_upgrade_code/exosome_all_samples_hg_19_trim/work/bcbiotx/tmpnYiV0M/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.clean.tmp.fastq.gz -o bcbio_analysis/bcbio_upgrade_code/exosome_all_samples_hg_19_trim/work/bcbiotx/tmpnYiV0M/10B-lp1_GAGATTCC-ATAGAGGC_AH3VFYBCX2_L001_001.R1.clean.fastq.gz -m 17

It seems to be that atropos is running twice per sample one with only the adapter then using the commands I need. Is there a way to remove this unnecessary step and make atropos run only once? thanks

lpantano commented 6 years ago

Hi,

sorry for the issues. Can you update bcbio? that should fix it.

About the second question, you should just put only [-u 3] in resources since the adapter has been removed already.

Atropos runs twice when there are extra options because it was mainly adapted to be compatible with the 4N protocol that needs to remove nucleotides after adapter trimming. And for that it needs to run twice because the -u option happens before the adapter trimming in the set of actions if both things are together.

I'll need to think about it how to allow options to be added to first command but not in the case of the 4N protocol.

Thanks for the idea

ahsen1402 commented 6 years ago

Thanks. I started rerunning it after the update hope it goes through smoothly. For the second issue adding [-u 3] restarts the atropos again as you said. Before I upgraded the code, The solution i found it specify adapter as follows : (adapters: ["AAAAAAAAAA -u 3"]), it seemed to work. However, after updating this did not seem to work. Have you changed anything?

Best

lpantano commented 6 years ago

Hi,

sorry about that. What is exactly that is not working?

Can you tell me what it is and the commands that happens?

It should work the same. I added and option to accept 4N in the adapter variable to run the extra step automatically but I didnt remove any option to add extra parameters.

Just to clarify the steps, atrapos runs ones with the adapter, and from that file runs again with the extra options in the yaml file. Is not doing this?

Cheers

sent not from my computer

On Apr 24, 2018, at 21:10, ahsen1402 notifications@github.com wrote:

IT looks like after the upgrade of the bcbio my solution of putting adapters: ["AAAAAAAAAA -u 3"] does not seem to work. HAve you changed anything?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ahsen1402 commented 6 years ago

Hi Lorena,

Let me clarify before the upgrade if i enter (adapters: ["AAAAAAAAAA -u 3"]) in the yaml, there was a single run of atropos with correct parameters as -a AAAAAAAAAA -u 3. However, after the upgrade when I do the same I get an error from atropos actually it does not recognize the adapter. I get the following error:

2018-04-24 16:14:12,204 INFO: This is Atropos 1.1.18 with Python 3.6.4
2018-04-24 16:14:12,210 INFO: Trimming 0 adapter with at most 10.0% errors in single-end mode ...
2018-04-24 16:14:29,532 ERROR: Atropos error
Traceback (most recent call last):
EOFError: gzip process returned non-zero exit code -15. Is the input file truncated or corrupt?
' returned non-zero exit status 1

I guess it does not recognize the adapter any more if you specify it using adapters: ["AAAAAAAAAA -u 3"] .

Thanks

lpantano commented 6 years ago

I see,

No that won’t work, and actually it should have never worked.

I am working in a update to make things run at ones using the resources option that is the right way, but it would take me a couple of days.

I know that it would be an extra run of Atropos, but if you put -u 3 into the resources it will run, although running twice Atropos, but getting to the right results anyways. The second run should be quite fast because is only cutting reads, so just need to go through the file and that’s all.

Sorry for the troubles.

Cheers

On Apr 25, 2018, at 9:29 AM, ahsen1402 notifications@github.com wrote:

Hi Lorena,

Let me clarify before the upgrade if i enter (adapters: ["AAAAAAAAAA -u 3"]) in the yaml, there was a single run of atropos with correct parameters as -a AAAAAAAAAA -u 3. However, after the upgrade when I do the same I get an error from atropos actually it does not recognize the adapter. I get the following error:

2018-04-24 16:14:12,204 INFO: This is Atropos 1.1.18 with Python 3.6.4 2018-04-24 16:14:12,210 INFO: Trimming 0 adapter with at most 10.0% errors in single-end mode ... 2018-04-24 16:14:29,532 ERROR: Atropos error Traceback (most recent call last): EOFError: gzip process returned non-zero exit code -15. Is the input file truncated or corrupt? ' returned non-zero exit status 1 I guess it does not recognize the adapter any more if you specify it using adapters: ["AAAAAAAAAA -u 3"] .

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-384286567, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HG9XJpJxW5bZUU1zrocvg49EZKh5ks5tsHo1gaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Thanks very much. No problem please let me know when you have updated this. Looking forward to it.

ahsen1402 commented 6 years ago

Hi, I see that you added a new software for qc the qualimap, this runs very slow (the fastest was 67 minutes but it can get upto 3-4 hours for some sample.) For some reason if I provide pretrimmed data it takes 11 minutes to run v.s. if i include atropos in the analysis that trim inside the bcbio it runs really slow as i mentioned above. maybe there is some bug in that software or is this expected. This significantly slows down the analysis and I am not sure how essential it is. Do you have any option to disable it?

Thanks

lpantano commented 6 years ago

Hi,

You can turn it off by using turn_off: qualimap. tools_off is under algorithm keys.

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#changing-bcbio-defaults

Let me know if that makes the trick.

On Apr 27, 2018, at 3:12 PM, ahsen1402 notifications@github.com wrote:

Hi, I see that you added a new software for qc the qualimap, for some of my samples this software runs really fast (~5mins) but for others for some reason it runs really slow (90 minutes), maybe there is some bug in that software. Have you had any similar issues with that? Any option to disable it?

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-385067111, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HG9VbSZYayxnd8clT-VDx_Dndltaks5ts22PgaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Thanks. I am visualizing the results with seqcluster.db at the same time I have went over the report.Rmd file and created the clus_ma object there. I thought these two should show the same data but the total number reads in the seqcluster.viz for a given cluster ID differs from the corresponding entry in the clus_ma object. Am I missing something?

lpantano commented 6 years ago

Hi,

clus_ma is the raw counts for the cluster. In seqcluster.db you will see the log2 normalized counts over the transcriptome and the counts for each sequence in the cluster, so actually there is no place where the two information is in the same place.

seqcluster.db is for exploration proposes or prioritization once you do the differential expression with clus_ma matrix, that needs to go through all the process with DESeq2 or any other tool.

Hope this helps.

On Apr 30, 2018, at 2:01 PM, ahsen1402 notifications@github.com wrote:

Thanks. I am visualizing the results with seqcluster.db at the same time I have went over the report.Rmd file and created the clus_ma object there. I thought these two should show the same data but the total number reads in the seqcluster.viz for a given cluster ID differs from the corresponding entry in the clus_ma object. Am I missing something?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-385479290, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HIMgLnjRONfM0rByJmqi4qOBPdHAks5tt1FlgaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Thank you very much for your explanation and help. BTW do you have the miRge2.0. integrated into the tool?

lpantano commented 6 years ago

Hi,

It is integrated but you need to install it manually for now until I figure out a way to get the package in bioconda.

As well, you need to download manually its data and point it in the yaml file as explain it here:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq

Cheers

On May 1, 2018, at 2:23 PM, ahsen1402 notifications@github.com wrote:

Thank you very much for your explanation and help. BTW do you have the miRge2.0. integrated into the tool?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-385747640, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HAOTpU9de4XgaoY5DEiF_WNE4Yghks5tuKgRgaJpZM4Sw8OM.

ahsen1402 commented 6 years ago

Thanks. In the seqcluster analysis do you have an upper limit on the sizes of the reads that you input into the analysis? (I.e. are reads bigger > X are excluded from the analysis) In the annotation file you have do you also include the coding genes mRNAs? How can I import the seqcluster data into R in a similar format to seqcluster Viz? I.e. per each sequence I need the cluster ID as well as its numbers in different samples.

Thanks

lpantano commented 6 years ago

From the beginning there is a limit of 40 nt reads. Longer reads got into the *fragment.fastq.gz file in the trimmed folder of the working directory.

There is no way to exactly import the same data in R. If you have seqcluster 1.2.4a (build 4), then you should have a file that is counts_sequence.tsv that will have the information for each sequence in the cluster. You can normalize that with DESeq2 or any other, and you should have the annotation of the cluster as well there or in the other file.

I hope this help.

Cheers

On May 3, 2018, at 4:50 PM, ahsen1402 notifications@github.com wrote:

Thanks. In the seqcluster analysis do you have an upper limit on the sizes of the reads that you input into the analysis? (I.e. are reads bigger > X are excluded from the analysis) In the annotation file you have do you also include the coding genes mRNAs? How can I import the seqcluster data into R in a similar format to seqcluster Viz? I.e. per each sequence I need the cluster ID as well as its numbers in different samples.

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ahsen1402 commented 6 years ago

I see is there a way to increase this upper limit and do the analysis to all sequences? I have the counts but I would like to focus more on the individual sequences? Does the seqcluster.json file contains this? Would have been great to have a list with each entry corresponding to a cluster and the matrix with rownames the cluster sequence. IS this hard to do you think?

lpantano commented 6 years ago

Hi,

Right now there is no way to do it, mainly because I don’t know how estable is seqcluster with longer sequences. It was designed for miRNA like sequences, otherwise you enter more in the RNA analysis approach where you map to a longer gene.

Yes, seqcluster.json has all the information, but you’ll need to parse it. It is dictionary with each cluster being an element, and you have there sequences, and counts for each sample. However, I think is easier just to do the following:

If you force the installation of secluster with bcbio_conda -f update seqcluster, you can remove counts.tsv from seqcluster and restart the run, and it will generate that file where you have each sequence in each cluster. From that file you can create the matrix as you wish because all the information is there.

Hope this help.

On May 4, 2018, at 11:00 AM, ahsen1402 notifications@github.com wrote:

I see is there a way to increase this upper limit and do the analysis to all sequences? I have the counts but I would like to focus more on the individual sequences? Does the seqcluster.json file contains this? Would have been great to have a list with each entry corresponding to a cluster and the matrix with rownames the cluster sequence. IS this hard to do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2339#issuecomment-386628744, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HHmd__K5UaL1Hdbhl4B7TOMvqc_Rks5tvG0CgaJpZM4Sw8OM.

roryk commented 5 years ago

Thanks, closing for now as this was a catch-all issue and there hasn't been much action. Please open up a specific issue with the small RNA pipeline if there are still some problems. Thank you!