dieterich-lab / rp-bp

Rp-Bp is a Bayesian approach to predict, at base-pair resolution, ribosome occupancy and translation.
MIT License
7 stars 5 forks source link

strong bias of found ORF classes between strands #54

Closed tmargus closed 7 years ago

tmargus commented 7 years ago

Hi Brandon,

I found something unexpected when plotted out metagene profiles using included IPy-notebook file "create-orf-type-metagene-profiles.ipynb" by using my own data from yeast.

Most of ORFs are tagged as "canonical", canonical_truncated", or "within". In forward strand majority ORFs are "canonical" (2,563) and "truncated/within" (121/114) forms smaller group. Surprisingly in reverse strand there is a much less "canonical" (239) while "truncated" and "within" forms the major class of ORFs (1,851 and 836 respectively). orf-metaplot

For reverse strand there is no ribosomes pilup at ends when in forward strand its clearly visible. Could it be that coordinates become a bit shifted in reverse strand?
What kind test could be best to run to figure out what went wrong if any?

Cheers, Tonu

bmmalone commented 7 years ago

Hi Tonu,

Given the counts of ORF types, there could be several things going on.

I'm not sure what sequencing protocol you are using, but it could be that the relevant adapter sequences are not present in the file given to flexbar. I believe the one included with the distribution includes standard TruSeq and ArtSeq adapters, but if you are using something else, that could be a problem.

Something else (which you may have already checked) is that the number of uniquely-mapping aligned reads in the <riboseq_data>/without-rrna-mapping/<sample-name>[.<note>]-unique.bam file are roughly the same.

Likewise, while it isn't amazingly helpful for riboseq, you can also try FastQC and just see if anything comes up. I would run it on the fastq file after removing ribosomal rna: <riboseq_data>/without-rrna/<sample-name>[.<note>].fastq.gz. Of course, the read length distribution throws up flags, but most of the other results there should look relatively normal.

It could also be that the semantics of start/end are different on the reverse strand. To check that, you can load the annotated bed file (the <genome_base_path>/<genome_name>.annotated.bed.gz file) into something like IGV and make sure the "thick" parts match up with starts and stops in the genome sequences. In particular, the start codon should be included in the "thick" part and the stop codon should not be included. I attached example images from IGV to show what I mean.

When this "off by one" kind of problem has turned up before, we usually ended up with many "within" and "suspect" ORFs predicted as translated, but not so many "canonical_truncated" ones. The "truncated" ORFs are in-frame, but there is better signal starting from a down-stream start than the canonical one. Here, you could open the uniquely-mapping read bam file in IGV and just take a look at the signal. Unfortunately, we do not currently create a bam file which removes the non-periodic reads and adjusts for the P-site offsets (we just directly do that when creating the ORF profiles), so there is not an easy way to view that. However, there still could be something here.

Depending on how that looks, you could also look into the "unfiltered" predictions bed file. In ipython, you can look at the count of the different ORF types here with something like:

import misc.bio_utils.bed_utils as bed_utils
unfiltered_predictions = bed_utils.read_bed("/path/to/my/unfiltered-predictions.bed.gz")
orf_type_groups = unfiltered_predictions.groupby(['orf_type', 'strand'])
orf_type_groups.size()

(unfiltered_predictions is just a pandas data frame here). The (so far undocumented) create-orf-types-pie-chart script can also be used to visualize these counts.

These are the predictions before the filtering described in the "Final prediction set" paragraph (the last one in the methods part) in the paper. If there are many more canonical ORFs on the negative strand here than in the final set, then it means the periodic signal is stronger in the truncated (or "within") ORFs than the canonical one, so those are selected instead.


There are other things to look at, as well, but I think this covers most of the easy and not-so-difficult things.

We have run data from a number of different species and using several different protocols, and we haven't encountered any protocols that have a noticeable strand bias; however, if that were the case, it could still be handled within the current framework. However, it would require some code to handle alignments from the different strands separately. In essence, though, alignments from different strands could be treated as replicates (with independent periodic read lengths, P-site offsets, etc.).

Please let me know how this goes, or if there is any trouble checking anything. Hopefully something here gives some clues, though.

Have a good day, Brandon

reverse strand end:

reverse-strand-end

reverse strand start:

reverse-strand-start

tmargus commented 7 years ago

Hi Brandon,

... the relevant adapter sequences are not present in the file given to flexbar. ..

Yes, it is. We use Ingolia (2012) protocol and this adapter is included

... is that the number of uniquely-mapping aligned reads in the /without-rrna-mapping/[.]-unique.bam file are roughly the same.

Yes, the number of reads is almost identical - 37 mil. in both strand

... It could also be that the semantics of start/end are different on the reverse strand ..

When I checked a BAM file, bed file and annotation with IGV:

  1. I see that in reverse strand first AUG is systematically missed and ORF is classified as canonical_truncated instead of canonical

image

  1. ORFs what are labelled as canonical have only a few reads mapped on it.

image

Is it related to low default value of min_signal: 5 ? But it should not give a whole ORF anyway.

Cheers, Tonu

bmmalone commented 7 years ago

Hi Tonu,

in reverse strand first AUG is systematically missed

Hmm... that is interesting; we've not come across any sort of bias like this in any of the datasets we've analyzed.

For the first transcript (YAL...1C), it definitely looks like the full canonical transcript should be predicted as translated. It seems that the canonical ORF is not even showing up in the unfiltered predictions. This means either that the signal at the beginning are really not periodic or that the ORF is not extracted correctly.

Can you find the canonical ORF in the Bayes factor file (<riboseq_data>/orf-predictions/<sample_name>[.<note>]-unique.length-<lengths>.offset-<offsets>.smooth.frac-<fraction>.rw-<reweighting_iterations>.bayes-factors.bed.gz)? You can find it with something like:

zcat <bf file> | grep "<transcript name>.*canonical<ctrl + V, tab>"

There, you can see the posterior estimates of the Bayes factor (and a bunch of other stuff). If the BF is below 5, then it means the reads at the beginning are not periodic. If the canonical ORF is not in that file, then there is a problem with extracting it.

Is it related to low default value of min_signal: 5?

It could be. This gave us good results for our data, but you could increase it if you have deeper coverage.

Please let me know about the canonical ORF; either way it would be good to know what's going on. Thanks.

Have a good day, Brandon

tmargus commented 7 years ago

Hi Brandon,

The full length canonical orf_type is in the bed file, so ORF extraction works fine but byes factor is estimated as -inf for the gene YAL021C first 252 positions.

Are columns x_1_sum .. x_3_sum indicating the number of reads assigned to I/II/III position of codon in the P-site? If it is so then there is a periodicity shift from third position to first position of the codon.

image

Orange line indicates ORF what is in filtered bed file finally.

Tonu

PS. The same pattern is true for ~20 manually checked genes.

bmmalone commented 7 years ago

Hi Tonu,

Are columns x_1_sum .. x_3_sum indicating the number of reads assigned to I/II/III position of codon in the P-site?

Yes, that is correct. Because there are more reads in x_3 than x_1, we do not calculate the Bayes factor for those ORFs (so it is just set to -inf). In the paper, we describe this in the "Filtering unlikely ORFs" section.

a periodicity shift from third position to first position of the codon.

The does seem like what's going on. In that sense, Rp-Bp is behaving as expected. Our models assume "translation" means periodicity exclusively in x_1 and nothing much in the other frames. So if there are a lot of reads in other frames, our assumptions kick in, and we do not predict the ORF as translated.

That said, it is really suspicious that this out-of-frame bias occurs only near start codons on the reverse strand. I am also a bit unsure why the in-frame count (x_1_sum) increases for the truncated versions.

One thing you could try is to look at the read length distribution of the ORF profiles. This updated doc describes how to create these. The orf_num from there (described in the documents) also matches the orf_num in the various prediction files. The main thing here is to try to see if there is a read length-specific bias that is causing this.

I seem to recall that in a few of the riboseq protocols, there is an extra base that needs to be trimmed, in addition to the adapter sequences. I'm not sure if you're already doing this; if not, though, you can try setting that flexbar option (pre_trim_left) in the config file.


To sum up, the frame counts do explain why the canonical ORFs are not being predicted as translated. However, that doesn't explain why this bias is present there at all. Hopefully, the read length-specific profiles give some clues, and/or maybe additional trimming corrects for the bias. Please let me know how things go. Thanks.

Have a good day, Brandon

bmmalone commented 7 years ago

Hi Tonu,

I just wanted to double-check if you had come across anything that may explain the strand-specific bias. If there is anything in particular that would be helpful to track down what's going on, please let me know.

I'm very slowly documenting the analysis scripts, so maybe there is something already that I could add. If not, it may be possible to quickly add some new sort of analysis script.

Have a good day, Brandon

tmargus commented 7 years ago

Hi Brandon,

Yes I had some ideas to test. First, to tested whether it is our dataset specific. I used a set of Green's 2015 data SRR2046309 (12 milj effective reads) and got similar results. Reverse strand: 145, 1214, and 509 for canonical, canonical_truncated, and within respectively. So its probably not dataset specific.

Then I made similar analysis using humans Chr1-5 and found that main difference is in the number of "canonical" ORFs and "within" Numbers for human Chr1-5 are: forward 937/381/14 and reverse 155/434/266 for canonical/canonical_truncated/within. I did not look for other type of ORF's in human but there should be more in 5-3'UTRs too. Bias is here but not so strong as in yeast.

Finally I thought to test whether program treats strands differently, i. e. I want to exclude that possibility. One way to do that is revert genome and annotation: make it reverse complement and the same with annotation to match with sequence. When this bias is caused by genes it should swap the strand if it is related to how program treats strands then bias should stay in reverse strand. I used UGene to revert genome and annotation. It can do that automatically but problem is that it can handle GenBank format properly. I had hard times to get from gb gtf. I got gff3 format using BCBio GFF library. How to specify annotation input from gff in *.yaml file?

# the full path to the GTF file which contains the exon and CDS annotations gff: /Users/tmargus/projects/rpbp/green_yeast_2015/0-yeast_reference/ScerR64-1-1.85.gff or gtf: /Users/tmargus/projects/rpbp/green_yeast_2015/0-yeast_reference/ScerR64-1-1.85.gff

didn't work

OR is there some simple way to convert gff3 to gtf? I used Genometools gt gff3_to_gtf but it complains about gff3, seems it's standard for gff3 is more strict than BCBio GFF library can extract from gb format.

Cheers, Tonu

bmmalone commented 7 years ago

Hi Tonu,

Answering a bit in reverse...

is there some simple way to convert gff3 to gtf?

I am not aware of a better way to do this; however, for ensembl, you can directly download the gtf files, here for yeast, for example.

I also had a look at the gff3 annotations, and there are some differences which will affect parsing, etc, in particular, the attribute names for exons and CDSs do not match those for gtf. I initially thought gff3 would also work, but I'm not so sure, now, especially for ORFs on multi-exon transcripts, since we specifically use the transcript_id field to match exons.

However, the example we looked at earlier (YAL021C), is not a multi-exon transcript, so it is not clear that gff vs. gtf would make a difference there.

Nevertheless, could you please give it a shot using the gtf file and let me know how that works. I'm not expecting it to "fix" things, but there could be some interaction I'm not thinking of.

How to specify annotation input from gff in *.yaml file?

You would still need to use the gtf key, but I am pretty sure that gff will work.

program treats strands differently

When creating the profiles, the ORFs on different strands are handled somewhat differently. Basically, the reverse strand ORF profiles are created in the orientation of the 5' to 3' on the genome, then flipped at the end so they are in the orientation of the ORF. However, that doesn't really explain what is going on here, where there seems to be a frameshift or some such.

Bias is here (in human) but not so strong as in yeast.

Hmm... this is strange. We've also ran a large number of human samples, and we never find any bias. For example, for the HEK-gao dataset we use in the paper, the breakdown is 4804/574/62 and the forward strand and 4654/577/70 on the reverse. (This might be slightly different than what we report in the paper due to updates to the pipeline.)

Which annotations were you using here?

dataset specific

I will also see if I can reproduce what is going on with the public datasets. Just to double-check, are you using these samples, with this genome sequence? I will directly use the gtf file linked above.

Have a good day, Brandon

bmmalone commented 7 years ago

Hi Tonu,

Could you please also let me know how you are creating the ribosomal index (i.e., which sequences)? I just want to have a setup that is as close as possible to what you have.

Have a good day, Brandon

tmargus commented 7 years ago

Hi Brandon,

dataset specific

Yes, I used Saccharomyces_cerevisiae.R64-1-1.85 genome, ncRNA and annotation from Ensembl. Non coding RNA is also from the same version.

From published data (Green's lab Cell paper 2015) SRR2046310 is better to use than SRR2046309. Both are WT controls but first one have less noncoding (12%) than the last one (60%). These links above are correct.

For human I used first five chromosomes (pooled together) of GRCh38 Ensembl

dna: Chr1 etc..

annotation in GTF: Homo_sapiens.GRCh38.87.chr.gtf and then constructed GTF for first 5 Chr by going through awk '$1==1 {print}' 2,3,4,5 cat Homo_sapiens.GRCh38.87.chr.gtf | awk '$1==1 {print}' >> grch38_chr_1-5.gtf

non coding

Cheers, Tonu

tmargus commented 7 years ago

For example, for the HEK-gao dataset we use in the paper, the breakdown is 4804/574/62 and the forward strand and 4654/577/70 on the reverse.

I try to run Gao's data SRR1630831 on my mac against human genome (GRCh38 - chromosomes only). Let see could I get similar results.

tmargus commented 7 years ago

Hi Brandon,

About this GFF3 to GTF conversion question. It is not a problem generally. As you mentioned previously you can get gtf forom Ensembl repository and these files are working fine with rpbp. That is what I am using.

My question was initiated because of that unusual task to revert standard genomes strands together with annotation - people normally don't do that. Solutions I found were not able to write proper GTF output. It probably would take less time to write some script than spend time for searching and testing.

I think, I dont't need that anymore. Hopefully testing same datasets on different platforms will guide us close to reasons of found bias.

Cheers, Tonu

bmmalone commented 7 years ago

Hi Tonu,

GFF3 to GTF conversion question

Ah, okay, I see what you mean now. I don't really know a good way to do the flipping and such.

testing same datasets on different platforms will guide us close to reasons of found bias

I hope so. I'm going to try the yeast data today, and I'll let you know what I find.

Have a good day, Brandon

bmmalone commented 7 years ago

Hi Tonu,

So... good/bad news. I downloaded the yeast dataset and ran it on our cluster. I observe basically the same thing that you do: a "regular" distribution of ORF types on the forward strand, but very few canonical and many canonical_truncated (and sometimes many within; I ran all of the riboseq samples from that study).

Briefly looking at the counts of reads in each frame, it again appears like there is some sort of frame shift or something. The annotations and things in IGV also look okay (e.g., start codon matches with the "thick" part, etc.).

This is the same code that we use for all of the other samples we have processed, so I don't really think it's a bug related to handling the strands.

Also, the "out of frame" stuff seems to only be happening at the start of the reverse strand ORFs, not later on (thus, all of the truncated versions), so that also suggests there may be something else going on.

Do you know if there is anything about the protocol that may result in the reverse strand behaving differently that the forward strand?

I will try to dig a bit more and see if anything comes up. Please also let me know how things go with the HEK dataset. Thanks.

Have a good day, Brandon

bmmalone commented 7 years ago

Hi Tonu,

There are a few things that can be checked; I'm going to make a list here to not forget. Also, please feel free to add to it.

Have a good day, Brandon

ydr341c_iv 1151805-1153071 -

tmargus commented 7 years ago

Hi Brandon,

So... good/bad news. I downloaded the yeast dataset and ran it on our cluster. I observe basically the same thing that you do:

Then we can rule out platform specific differences. That is good.

Briefly looking at the counts of reads in each frame, it again appears like there is some sort of frame shift or something.

The same here. The pattern we see is same. It can't be the real frameshift for translating ribosomes. It is possible that there might be a shift in 5' end of some reads in some regions with a biological meaning but seeing it in one strand predominantly doesn't make sense.

I would try to use 3 prime assignment and see whether it gives the same results. Is there any undocumented command line option or line in *.yaml file to switch specific 3 prime assignment? Or one to have to modify files extract_metagene_profiles.py and ribo_utils.get_p_sites like it is referred in #42

Do you know if there is anything about the protocol that may result in the reverse strand behaving differently that the forward strand?

We are using Ingolia's protocol (Nature Protocols 2012) and Green's lab using the same. It is hard to imagine some strand specificity too slip into libraries. First strands are arbitrary for different chromosomes and in riboseq you extract ribosome protected mRNA fragments. I cant imagine something will flip in from here.

...Please also let me know how things go with the HEK dataset.

STAR indexing needs a bit more RAM to live together with OS than I have in my desktop [32G]. So I skip first two chromosomes and rerun it. I let you know when it will finish successfully.

Cheers, Tonu

tmargus commented 7 years ago

Hi Brandon,

Alignments. are there systemic mismatches at the 5' end of reverse strand ORFs?

IGV shows mismatches as colour lines on reads on BAM alignment but I don't see any enrichment at the ends of reads in first part of genes on reverse strand. Unless mismatches are trimmed at the ends but as far I know it is not default behaviour of STAR.

Cheers, Tonu

bmmalone commented 7 years ago

Hi Tonu,

It can't be the real frameshift for translating ribosomes.

I agree. Maybe in a few places that could happen, but not ubiquitously throughout the genome.

Is there any undocumented command line option or line

Unfortunatley, no. As far as I know, Issue #42 does at least document the only places where the 5' ends are actually selected, so, in principle, it shouldn't be too difficult to add this. The bigger difficulty is probably more like propagating the "use-three-prime-ends" flag around the pipeline scripts. Again, not really "difficult", but it would require tests/checking in several places.

If you are up for it, one option would be for you to implement this feature and create a pull request to add it in here. I can also try to take a look, but I don't really know when it would be (not too long, but probably not today or tomorrow).

I cant imagine something will slip in from here.

It does seem unlikely; after discussing a little with others here working on ribosome profiling, they also doubt it is something "real".

I don't see any enrichment at the ends of reads in first part of genes on reverse strand.

I also had a look, and I don't see anything obvious in IGV. In some cases, most of the signal for an ORF is lost when using only the uniquely aligning reads, but there are plenty of instances where this is not the case, and the truncated version is still predicted.


I will also try to have a look at the other two points in the check list above ("ORF type metagene profiles" and "Length-specific ORF profiles" since there are already scripts to create these things.

Also, looking at the CIGAR strings in IGV, there seems to be a fair amount of soft-clipping going on with the alignments. We don't currently take that into account, but it is possible that is something we should do. I imagine this would just be another option to STAR, so that could also be something to consider before getting too deep into the internals.

Have a good day, Brandon

tmargus commented 7 years ago

Hi Brandon,

I used a subset (20 milj reads from fastq) of our yeast data and run it in rpbp and in my pipeline lets call it RiboSeqPy. First difference I see is the number of reads mapped once on genome (unique). STAR reports ~10 miljon reads when after HISAT alignment there is ~5.6 miljon reads with tag NH:i:1. Main part of this difference comes from prefiltering step what removes reads with lower quality in my pipeline.

"Length specific profiles" around start (I used RiboSeqPy and read length range from 25 to 34 last line is sum) show that profiles generated from STAR alignment (left panel) look a bit ragged with sudden pikes when HISAT profiles (right panel) after first pileup around initiation are relatively smooth. Another thing what looks strange is the number of reads on plots. There is an order of magnitude less reads on plots generated from STAR alignment. Leaving difference of reads number on plot aside, what might be some another reasons, I would like to try to feed rpbp with the custom BAM made by HISAT to see whether there are some changes in strand specific bias.

image

HISAT2 default behaviour is soft-clipping what can be turned off by --no-softclip

Cheers, Tonu

bmmalone commented 7 years ago

Hi Tonu,

I will have a look at the entire results a bit later, but...

prefiltering which removes reads with lower quality

You can play with the flexbar options in the config file to control this, but we have experimented too much, so I don't really know how it would affect downstream analysis.

I would like to try to feed rpbp with the custom BAM

Yes, this is possible. There are docs here.

From what you described, it sounds like you could put the bam files with only unique alignments at: <riboseq_data>/without-rrna-mapping/<sample_name>[.<note>]-unique.bam, where the <field> names map whatever is in the yaml config file. You can also remove all of the files which are generated before that in the pipeline (without-adapters, etc.). The pipeline will complain about missing files at first, but once it gets to the relevant steps (creating metagene profiles and estimating P-site offsets), it will pick up your files.

HISAT alignment

Are the hisat commands you use based on the following?

["hisat2", "--no-unal", "-p 6", "--no-softclip", "--dta", "-x", Genome, "-U", Input, "-S", Output]

And are the STAR results based on the output of rpbp?

I would like to test STAR without allowing soft-clipping to compare, but want to make sure I have the right things first.


The prefiltering, alignment, etc., steps aren't really a "proper" part of Rp-Bp, so it would really be nice to have a feel for how these choices affect the inference in the models.

Have a good day, Brandon

tmargus commented 7 years ago

Hi Brandon,

Are the hisat commands you use based on the following?

["hisat2", "--no-unal", "-p 6", "--no-softclip", "--dta", "-x", Genome, "-U", Input, "-S", Output]

Yes thats it. Option --no-softclip needs version 2.0.5.

And are the STAR results based on the output of rpbp?

Yes, alignment I referred as STAR alignment is produced by rpbp

Cheers, Tonu

tmargus commented 7 years ago

Hi Brandon,

Reads filtering by quality and hisat2 alignment gives almost identical result to rpbp/STAR bam file.

rbpp and STAR gives 25623/121/114 for forward and 239/1851/836 for reverse
HISAT2 alignment ends up with 2495/116/95 for forward and 185/1697/867 for reverse strand numbers are ORF classes canonical/canonical_truncated/within

That don't explain why metagenomic profiles (I posted before) are so different. One reason might be that my script didn't handle rpbp/STAR bam file properly. I have to check.

Cheers, Tonu

bmmalone commented 7 years ago

Hi Tonu,

I haven't heard from you in a while, and I'm just wondering if you were able to track down what was going on?

Have a good day, Brandon

tmargus commented 7 years ago

Hi Brandon,

Nope. It grow over my head and I made a brake to let things settle. I started to see biases in HEK-Gao data where you didn't. But it wasn't clean repetition. Partially because my system limitations I used first 10 Chromosomes only and as I already had hg38 I used it instead of hg19. To add here that Python 3.5 have it's peculiarities under OSX I felt that I have too many unknowns in the equation.

To see wether this strand specific bias is solely data driven it might still worth to try run it on reverse-complemented genome/annotation. I asked is there such option available under genometools and they were kind enough to write the small script for reverting annotation. It still a bit buggy :-)

Coming back to yeast. On yeast protein coding genes almost exclusively (6,363 protein coding genes from 6,700) have the same leftmost rightmost coordinates for gene, exon, and transcript. The CDS differs from above mentioned by 3 nt as it don't contain stop codon (ensemble gff/gtf). That might be relevant on the light what was posted to "ORF extraction occasionally extracts 'wrong' ORFs" #64. But as I remember we checked one ORF extraction and it was ok!

Cheers, Tonu

tmargus commented 7 years ago

Hi Brandon, Now it's ok! There is no bias between strands anymore. That is Great!!! I added de_novo prediction to analyses too and it works fine.

Best, Tõnu

bmmalone commented 7 years ago

Hi Tonu,

Great! I'm glad to hear it's working now. I meant to write a bit earlier, but I left for vacation in the US.

I'm a bit surprised if there is much to find in a de novo assembly for yeast. Does it look like unannotated "UTR"s? I'm really excited to hear what you find.

Have a good day, Brandon

P.S. I'm going to close the issue because I think the orf_num stuff fixes the actual problem; however, please let us know how things are looking.

tmargus commented 7 years ago

Hi Brandon,

I haven't systematically checked where the group of ncRNA ORF's actually falls but I expect most of them to be uORFs and lesser degree dORFs. Tuller et al.,(2009) have summarised yeast UTR's showing coordinates for 4,409/5,127 of five- and _three_primeutr respectively.

If I understand correctly to get proper ORF notation (uRORF, dORF) out of rpbp _five_primeUTRs and _three_primeUTRs have to be a part of the main annotation but rpbp don't read features like five_prime_UTR. So it have to be included in exons and transcripts?. I can write a small script to extend exons and transcripts coordinates accordingly but is it sufficient? Or is there some part in misc. and/or riboutils. what have more strict requirements for GTF when converting it to BED format?

Second and separate issue for me is when I run rpbp with the whole dataset (140 milj reads per replica) the subprogram extract-orf-profiles crashes after completing 60-83% of "Finding all P-site intersections". It looks like it needs more RAM than 32GB. This RAM consumption seems to be related to the number of reads and my be with using de_novo transcript annotation. Is there some simple way to estimate how much RAM in this step (or what is the most RAM hungry step) program needs? If it depends form the number of reads then a workaround for me is to split biological replica into two fastq files and on later steps use merged predictions.

Cheers, Tonu

tmargus commented 7 years ago

Hi Brandon,

After adding UTR part to gene/transcript/exons coordinates u/dORFs are now recognised. That's fine. Memory problem was strange, I didn't see it later. Hard to say what caused it.

Cheers, Tonu

bmmalone commented 7 years ago

Hi Tonu,

Sorry also for the slow reply here; I was on vacation for the last couple weeks and not checking in so much.

It sounds like things are worked out, but I can still clarify a few things.

So [UTRs] have to be included in exons and transcripts?

rpbp uses exon and CDS features, linked by the transcript_id attribute from the GTF files. Basically, everything that is part of an exon and not included in a CDS is treated as a UTR region. We assume the start codon is part of the CDS region, but that the stop codon is not. Ensembl uses these semantics, so that's why we adopt them. We use the "thick" BED features to indicate the CDS regions. We can then use various bed intersect operations to label the ORFs.

When a de novo assembly is available, we can further intersect it with the annotations to separate totally novel transcripts (de novo) from those which overlap some of the annotations. In case you'd like to really check out the nuts and bolts, the label-orfs script includes this logic.

Is there some simple way to estimate how much RAM in this step?

The memory usage is a bit difficult to estimate ahead of time. Internally, we use sparse matrix representations (namely, compressed sparse row format) for the profiles; thus, as the sequencing depth increases, the efficiency of our data structures degrades somewhat (though it is still typically much better than a simple dense matrix). The relationship is roughly M=3*d, where d is the number of non-zero positions in the profiles and M is the number of stored values (presumably 64 bits each). See wikipedia and the scipy site for more details.

The memory usage is also affected by the number of processes used (--num-cpus). As more processes are used, more profiles, etc., are in memory at once. The relationship then becomes something like M=3*d*p, where M and d are defined as above and p is the number of processes.

There is some non-trivial additional overhead besides that, such as the memory for the bam file, but the profiles are the main thing.

Also, we have run into some problems where python actually reports more RAM usage than this because of the way python multiprocessing handles readonly shared memory (one of the biggest disadvantages of python :-/). For example, according to top, at one point, I was using around 2000% of the ram on my laptop without swapping. This is only actually a concern if you use something like ulimit or a job management system such as slurm, torque, pbs, etc., which monitor RAM usage and kill the process if it exceeds some threshold.

Have a good day, Brandon