flcvlr / MAmBA

analysis pipeline to assess m5C in microRNAs following Bisulfite small RNA sequencing
MIT License
1 stars 0 forks source link

Issue with setting up the bowtie index #1

Closed cahn20 closed 1 month ago

cahn20 commented 1 year ago

Hello,

I'm trying to follow your MAmBA test code using the following command after downloading the files from https://sites.google.com/a/uniroma1.it/valeriofulci-eng/software :

_/mamba.sh -a TGGAATTCTCGGGTGCCAAGG -s hsa -o S1 -f sample1.fq.gz -i /path/to/hg38/bowtie2index/basename -j 8 -n S1 -c GTCGATACCTCACATTAGATC -l log -C red

According to the instructions in the mamba.sh file for flag "-i", the format for the path to the bowtie index is supposed to be something like what is written in the command line code above, but the bowtie index you provide at your website has two different prefixes which are 18S and 28S. How should I be setting the "-i" flag here? I'm getting zero alignment using "-i /path/to/hg38/bowtie2index/18S_C_T_converted", "-i /path/to/hg38/bowtie2index/28S_C_T_converted", "-i /path/to/hg38/bowtie2index/18S", or even just "-i /path/to/hg38/bowtie2index/".

My software versions are as follows: tbb/2018 parallel/20140422 mpfr/4.0.0 libpng/1.2.59 xz/5.2.4 bedtools/2.30.0 bowtie2/2.4.2 curl/7.46.0 gcc/9.3.0 pcre2/10.37 zlib/1.2.11 pigz/2.6.0 samtools/1.9.0 gmp/6.0.0 gdal/2.1.2 perl/5.20.1 jags/4.3.0 mamba/1.0.0 tophat/2.0.13 mpc/1.0.2 geos/3.5.1 proj.4/4.9.1 R/4.1.1

Help would be greatly appreciated. Thanks.

flcvlr commented 1 year ago

Thanks for contacting me.

the "/path/to/hg38/bowtie2index/basename" should be replaced by a path pointing to the hg38 bowtie2 index that you should download on your PC. You should download and unzip the file and then as an argument to the "-i" option of MAmBA provide a full path to it. The path should include the folder AND the basename. If your hg38 index (consisting of 6 files named hg38.1.bt2, hg38.2.bt2, hg38.3.bt2, hg38.4.bt2, hg38.rev.1.bt2, hg38.rev.2.bt2) is in the "/home/Chris/Annotation/human_genome/" folder, than you should provide this path: "/home/Chris/Annotation/human_genome/hg38" after the -i option of MAmBA, where hg38 is the basename (i.e. the part before the dot) of your index.

the 18S and 28S indexes provided in the MAmBA reference are NOT the same thing. Those are internally handled by MAmBA that will find them (they are in the hsa/ folder that you provide to the script as an argument to the -s option) and you do not need to provide them as an argument to MAmBA. The whole human genome index instead is not part of the MAmBA reference, and you should download it and, using the -i option, let MAmBA know where to search for it.

Valerio

cahn20 commented 1 year ago

Thanks for the quick reply.

Which python version should I be using? The requirements for MAmBA are cutadapt v2.0 or later and tophat2; cutadapt v2.0 or later requires python3 and tophat2 doesn't work on python3.

I don't want to perform trimming as I already have trimmed fq files, but leaving the -a flag blank doesn't seem to skip the cutadapt step. If the trimming step could be skipped, I could just load python2 for tophat2.

Instead, cutadapt throws back an error along with an empty trim.fq file which causes errors downstream. Thanks.

flcvlr commented 1 year ago

To make a long story short, python2 and pithon3 are two different programming languages, not two alternative versions of the same software. You can safely have installed on the same machine python2 (to run software written in python2) AND python3 to run software written in python3. I am not really into the details of the environment into which you are running MAmBA, but loading both python2 and python3 in the same environment should be possible. Indeed, most linux machines have both python2 and python3 installed. On all the PCs we used for testing MAmBA we have both python2 and python3 installed and MAmBA works nicely.

That said, we all agree that any new software is expected to be developed in python3 (not python2) and python2 is there to support legacy software (such as Tophat2). In a future release of MAmBA we will replace Tophat2 with an alternative and more recent splice aware aligner. However this would require a major redesign of MAmBA.

Concerning skipping the trimming step, you are right this is not possible. If you do not provide any ADAPTER seq (or an ADAPTER seq shorter than 10 nt) MAmBA will still trim terminal Ns in the fastq file (using cutadapt). This is because in many cases NGS service provider "mask" ADAPTERS rather than trimming them, and therefore the final user has no ADAPTER seq to trim, but still a lot of terminal Ns to trim.

In case you cannot run MAmBA in an environment in which both python2 and python3 are available, just let me know and next week I will edit MAmBA to include an option to simply skip trimming.

Helping you to set up an environment in which both python2 and python3 are available is a more general issue not specifically linked to MAmBA, but in case you have more questions about that and you can provide me details about your configuration I can try to give you some advice. However, the best thing, in case you do not know how to fix it, rather than relying on my help (I am not the Admin of your PC), would be asking the admin of your system and/or IT people at your institution.

Valerio

cahn20 commented 1 year ago

Hi Valerio,

Thanks for the very detailed reply. There was some trouble resolving the issues on the cluster, so I just ended up removing the cutadapt section from the shell script and it seems to run without any errors.

Do you have a page available that explains what each of the output files mean? I'm just wondering if there is additional information such as some additional information in the 12th+ optional fields of the bam files regarding sequence/read conversion.

flcvlr commented 1 year ago

Dear Chris,

I am glad you succeeded running MAmBA. No, MAmBA does not add any additional tags in the bam file, the main output is represented by the content of the "images" folder and the text report files in the output directory you passed to the script

Images are relevant to check that the coverage profile reported is consistent with the mature miRNA annotation. A coverage profile not matching the annotated mature miRNA would suggest that you are looking at the pre-miRNA or pri-miRNA (or even worse, random fragments....). Also, the absolute number of reads mapping to the locus (actually, the highest coverage attained on the pre-miRNA) is reported, so that you can have an idea of how robust is the estimate of non-conversion (30% non converted reads with a coverage of 10 and 30% coverage with 10k reads are actually different). The Y axis in the non-conversion plot is linear or log depending on the corresponding option you passed to MAmBA. Occasionally, you may have very high non-conversion rates at low coverage positions (usually outside the annotated mature miRNA). Those are just random sequencing errors, interpreted by the software as non-conversion events. However they should not attain statistical significance.

Most of the text output files are in BED-like format and the most comprehensive is "hairpin_subbed_sig.bed", it is a tab delimited format in which you have

1) pre-miRNA name (according to mirBase) 2) start coordinates (within pre-mRNA) 3) end (always start + 1, we are looking at 1 nt intervals corresponding to C residues according to reference hg38) 4) mature miRNA name if that C residue is within a mature miRNA, according to mirBase, otherwise pre-miRNA name (i.e. the same as 1) 5) P-value computed by bisRNA 6) strand (of the genome, i.e. the genome strand on which the pre-miRNA lies) 7) non-conversion rate (i.e (observed non-converted C)/reads, the same as field8/field9 ) 8) non-converted C 9) number of reads overlapping that C residue (coverage, including both converted and non-converted) 10) if the C (according to mirBase annotation) lies within a mature miRNA "*", otherwise "."

in the terminal, by using awk (or similar utils) you can easily extract the lines you are most interested in:

some examples:

awk '{if (($9 > 50) && ($7 > 0.1)) print}' hairpin_subbed_sig.bed > coverage_50_non-conv_0.1 awk '{if (($9 > 50) && ($7 > 0.1) && ($10 == "") ) print}' hairpin_subbed_sig.bed > mature_miRNA_coverage_50_non-conv_0.1 awk '{if (($9 > 100) && ($5 < 0.05) && ($10 == "") ) print}' hairpin_subbed_sig.bed > mature_miRNA_coverage_100_P_val_0.05

Otherwise, the file can be loaded into R for further filtering and/or analyses

Please note that P-values are already adjusted for multiple testing (check BisRNA documentation for more details).

Furthermore in the logs directory, you will find a summary of the alignments and conversion rates estimated on 18S and 28S rRNAs (however i recommend using the same tRNA fragment reported in our paper, if you are using human samples). You will also find a summary of the reads processed by bowtie2 (after the line "rRNA/tRNA alignment (bowtie2)"), and the summary of TopHat2. beware that % computed by TopHat2 refer to the reads which did not align to rRNA/tRNAs, not the initial input. Finally, the number of reads aligned to miRNA hairpins (most of them actually will map on mature miRNAs) is reported. We should bear in mind that, due to C-->U conversion, for many mature miRNAs unambiguous alignment to a specific hairpin is impossible, the ones reported here are the unambiguous alignements that were effectively used by MAmBA (this is discussed in the paper also). Based on our experience, you should expect that this latter value (i.e. the number of reads mapping to miRNA hairpins) should equal ~20% of the total reads. However, in our experience this can vary a lot depending on the starting tissue/cell type, we obtained values ranging from 5~40%. Still, the adjusted P-value at each position will take into account the actuallly observed coverage, so, a low % of reads mapping to mature miRNAs will imply a lower resolution, as lowly covered miRNAs will not attain a significant coverage, but highly expressed miRNAs will yield a reliable result anyway.

If you used a 5'-phosphorylated control spike-in (or provided as spike-in sequence an endogenous 5' phosphorylated small RNA expressed in your samples and which you know is no m5C modified) the output of the conversion analysis on that small RNA will be in the "spike_in_analysis.log" file in the "logs" directory of each sample.

Valerio

PS when asking a new question, I would suggest opening a new thread with a new object, so other user will more easily find answers already posted. However, your questions highlighted that the current documentation of MAmBA is incomplete, I will work in the next days to add a full manual.

Il giorno mar 25 ott 2022 alle ore 18:02 Chris @.***> ha scritto:

Hi Valerio,

Thanks for the very detailed reply. There was some trouble resolving the issues on the cluster, so I just ended up removing the cutadapt section from the shell script and it seems to run without any errors.

Do you have a page available that explains what each of the output files mean? I'm just wondering if there is additional information such as some additional information in the 12th+ optional fields of the bam files regarding sequence/read conversion.

— Reply to this email directly, view it on GitHub https://github.com/flcvlr/MAmBA/issues/1#issuecomment-1290800965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMA5XKUO4S7TDBW3MOIGFSTWFAACRANCNFSM6AAAAAARKOUCGM . You are receiving this because you commented.Message ID: @.***>