Can MAZTER-mine code be applied to small RNA-seq data?

Dear Miguel,

It's interesting to analyze your tutorial data with these computational pipelines - bam2ReadEnds.R and mazter_mine.R. I'm trying to analyze small RNA libraries constructed by SMARTer smRNA-Seq Kit for Illumina with the help of your code.

But I'm not sure if my data satisfies the required format of your Input Library. This library of RNA fragments digested by mazF is paired-end data with a read length of 101 bp, and the adapter trimmed read length is about 15 ~ 98 bp.

When I used this trimmed data as input to your bam2ReadEnds.R code, the resulting RData was too small to get enough output for downstream analysis. I guess the reason is that my data doesn't match the exon block created by the exonBlockGen function in bam2ReadEnds.R.

This could be the problem during STAR alignment because there were too many unmapped reads because of the small size of smRNA data. So I added some STAR parameters recommended for smRNA analysis (--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0), and I got many uniquely mapped reads.

But when I applied these reads to the bam2ReadEnds.R code, the results were similar to the previous output. This didn't match the exon block and got small-sized RData.

Could you help apply this smRNA data to your code?

Sincerely, Seoyeon

Hello Seoyen.

Could you please provide me the files and code to reproduce your analysis. Without it is hard to understand what may be the problem.

I would need mainly from the bam files onward. After alignment

Best

On Tue, Jun 15, 2021 at 4:47 PM Seoyeon Kim @.***> wrote:

Dear Miguel,

I am using MAZTER-mine code to analyze my RNA fragment which was digested by mazF. I'm new to bioinformatics and this is first time making issues on github, so I hope this issue doesn't bother you..

It was really interesting to analyze your tutorial data with these computational pipelines - bam2ReadEnds.R and mazter_mine.R. Now, I'm trying to analyze my data - especially small RNA libraries which was constructed by SMARTer smRNA-Seq Kit for Illumina - with the help of your code.

I'm afraid I'm not sure my data satisfies the required form of your Input Library. My library was processed from other lab, and it is paired-end data, the read length is 101 bp but I performed adapter trimming and the processed read length is about 15 ~ 98 bp - RNA fragments digested by mazF.

When I tried to apply this trimmed data to your bam2ReadEnds.R code, it didn't make R data appropriate for mazter_mine.R because the size of RData output was too small - maybe it's because my data doesn't match for the exon block created by the exonBlockGen function in bam2ReadEnds.R.

I thought this could be the problem during STAR alignment because there were too many unmapped reads because of the small size of smRNA data. So I added some STAR parameters recommended for smRNA analysis (--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0), and I got many uniquely mapped reads.

But when I applied this reads to the bam2ReadEnds.R code, the results were similar to previous output. This didn't match for the exon block and got really small sized RData.

I think this would be silly question for you, but could you please help me apply this smRNA data for your code?

Thank you very much,

Sincerly, Seoyeon

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SchwartzLab/mazter_mine/issues/6, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHMOKJGBXBWXX4MNGVSHQDTS5KY5ANCNFSM46XJ7NBA .

Miguel Angel García Campos

Website https://angelcampos.github.io/ | Twitter https://twitter.com/GarciaCamposM | GScholar https://scholar.google.com/citations?user=-J7nzzcAAAAJ&hl | Instagram https://www.instagram.com/garciacampos_m/

PhD candidate at the SchwartzLab https://www.weizmann.ac.il/molgen/Schwartz/, Molecular Genetics Department, Weizmann Institute of Science.

Hello Miguel,

Thanks for your reply. I uploaded the data via email. This is a small RNA data from human U2OS cell mRNA digested by mazF RNase and the library was constructed by SMARTer smRNA-Seq Kit for Illumina.

I trimmed all the adapter sequences included in the fastq.gz file and due to the small size of reads after trimming, I performed STAR alignment with following parameters. And among these the parameters that I mentioned on the first message was the ones recommended by Alex Dobin for the small sized RNAs.

--outFilterMismatchNoverLmax 0.05 \
--outFilterMatchNmin 16 \
--outFilterScoreMinOverLread 0 \
--outFilterMatchNminOverLread 0 \
-–alignIntronMax 300 \
-–alignMatesGapMax 1000 \
--outSAMunmapped Within \
--quantMode TranscriptomeSAM GeneCounts \
--runThreadN 8 \
--twopassMode Basic

After getting bam file, I ran the bam2ReadEnds.R code with hg38 v32 BED file. The code made some outputs, but the RData output was too small to be used in mazter_mine.R code.

When I run the bam2ReadEnds.R code line by line, I found out that some problems are associated with pairEndReadsToGenes function - because the reads weren't seemd to be matched properly with exon blocks made from gene annotation file. I wonder why the reads couldn't make enough match with the annotation data.

Thanks again for making mazter_mine pipeline and for your help. Please let me know f there's anything else I should do.

Sincerly, Seoyeon

I'm trying to load the file into different programs, namely R, IGV, and samtools, and I'm having trouble.

At the very least it should be sortable and indexable using samtools, but it seems that is not even possible.

This command was not able to sort the file: samtools sort U2OS2-9_output_20210612Aligned.out.bam -o U2OS2-9_output_20210612Aligned.sorted.out.bam @.***

It should be possible to sort it and indexed with the following commands: samtools sort U2OS2-9_output_20210612Aligned.out.bam -o U2OS2-9_output_20210612Aligned.sorted.out.bam @.*** samtools index U2OS2-9_output_20210612Aligned.sorted.out.bam

Can you please check the file and/or create it again and send it back sorted and the index file along?

The BED file that you used for gene annotation would be also be of help

PS: Could you make a double-check that your gene annotation corresponds to the genome you are aligning to. From the problem you mentioned it could be that that is the problem.

On Tue, Jun 15, 2021 at 7:30 PM Seoyeon Kim @.***> wrote:

Hello Miguel,

Thanks for your reply. I uploaded the data on dropbox and here is the link

https://www.dropbox.com/s/87kxkbhxe7r5qdn/U2OS2-9_output_20210612Aligned.out.bam?dl=0

This is a small RNA data from human U2OS cell mRNA digested by mazF RNase and the library was constructed by SMARTer smRNA-Seq Kit for Illumina.

I trimmed all the adapter sequences included in the fastq.gz file and due to the small size of reads after trimming, I performed STAR alignment with following parameters. And among these the parameters that I mentioned on the first message was the ones recommended by Alex Dobin for the small sized RNAs.

--outFilterMismatchNoverLmax 0.05 \

--outFilterMatchNmin 16 \

--outFilterScoreMinOverLread 0 \

--outFilterMatchNminOverLread 0 \

-–alignIntronMax 300 \

-–alignMatesGapMax 1000 \

--outSAMunmapped Within \

--quantMode TranscriptomeSAM GeneCounts \

--runThreadN 8 \

--twopassMode Basic

After getting bam file, I ran the bam2ReadEnds.R code with hg38 v32 BED file. The code made some outputs, but the RData output was too small to be used in mazter_mine.R code.

When I run the bam2ReadEnds.R code line by line, I found out that some problems are associated with pairEndReadsToGenes function - because the reads weren't seemd to be matched properly with exon blocks made from gene annotation file. I wonder why the reads couldn't make enough match with the annotation data.

Thanks again for making mazter_mine pipeline and for your help. Please let me know f there's anything else I should do.

Sincerly, Seoyeon

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SchwartzLab/mazter_mine/issues/6#issuecomment-861649787, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHMOKKOCXFUN3HZBVJV2ZTTS552HANCNFSM46XJ7NBA .

Miguel Angel García Campos

PhD candidate at the SchwartzLab https://www.weizmann.ac.il/molgen/Schwartz/, Molecular Genetics Department, Weizmann Institute of Science.

Dear Miguel,

I really appreciate for your help. After receiving your reply, I've been recreating and sorting my bam files, and I found out that there might be some problem in STAR alignment when I used my data as paired-end reads.

All of my reads (Read1 and Read2) are almost exactly overlap, maybe this is because my data is small RNA-seq data. The overlapping mates were causing low mappability, and aligning data with only Read1 indeed resulted in a much better mappability. This appears to be the cause of generating too small and inappropriate RData after running bam2ReadEnds.R code.

But I found out that your mazter_mine pipeline requires paired-end reads, so I would like to ask you if I could apply single end data for your mazter_mine pipeline. I would like to know if I can apply my data as singled-end reads for mazter_mine pipeline.

Thank you for reading my post, and hope you have a good day.

Sincerly, Seoyeon

Good to hear you made some progress trouble shooting your data. Sorry to not come back sooner, I've been busy. I'm reaching the end of my PhD.

The problem with mapping only read1 is that the end of the fragment is not mapped correctly. The whole pipeline is built considering an RNA-seq library that will be paired-end sequenced with strand awareness, this is, all the sequences are in principle in the same sense as the original transcript and not as the reverse complement, which would happen in non-strand-aware RNA-seq. These requirements are needed for the pipeline to work correctly.

About the removal of adapters I didnt use any adapter removal. It may be part from the preprocessing that the RNAseq unit does automatically when we give them the sequences of our adapters. But as STAR can soft-clip having adapter sequences should in principle not be a problem when using STAR. Another way for controling for this is checking if you have rthose sequences with FASTQC and remove them with Trim Galore.

Hope you manage to align your data.

Best

Miguel Angel García Campos PhD candidate at the SchwartzLab, Molecular Genetics Department, Weizmann Institute of Science.

On Tue, Jun 22, 2021, 19:41 Seoyeon Kim @.***> wrote:

Dear Miguel,

I really appreciate for your help. After receiving your reply, I've been recreating and sorting my bam files, and I found out that there might be some problem in STAR alignment when I used my data as paired-end reads.

All of my reads (Read1 and Read2) are almost exactly overlap, maybe this is because my data is small RNA-seq data. The overlapping mates were causing low mappability, and aligning data with only Read1 indeed resulted in a much better mappability. This appears to be the cause of generating too small and inappropriate RData after running bam2ReadEnds.R code.

But I found out that your mazter_mine pipeline requires paired-end reads, so I would like to ask you if I could apply single end data for your mazter_mine pipeline. I would like to know if I can apply my data as singled-end reads for mazter_mine pipeline.

Thank you for reading my post, and hope you have a good day.

Sincerly, Seoyeon

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SchwartzLab/mazter_mine/issues/6#issuecomment-866151085, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHMOKMAXZAHZDQTRL5NGUDTUC4MPANCNFSM46XJ7NBA .

SchwartzLab / mazter_mine

Can MAZTER-mine code be applied to small RNA-seq data? #6

Thanks for your reply. I uploaded the data on dropbox and here is the link