dieterich-lab / riboseq-utils

This project contains utilities common to the rp-bp and (soon-to-be pubic) translational efficiency projects. It is not intended to be a standalone project.
MIT License
0 stars 0 forks source link

Measuring translation efficiency with b-tea #8

Open eboileau opened 6 years ago

eboileau commented 6 years ago

simulate_data.txt

TE can be estimated using a simulation approach based on modifying real data to introduce differences in RNA/RFP data. The attached ad hoc script uses samtools to subsample a dataset, except for a proportion of regions/genes, that "should" show up as globally differentially translated. This may however significantly reduce the size of the dataset! The number of times a subset is downsampled can be adjusted, this should yield more uniform downsampling.

CDieterich commented 6 years ago

Ok, just to make sure that everyone is on the same page

1) We need to select a data set to work with @bmmalone - what do you think ? Should we just try some HEK data or start with the neurite / soma stuff.

2) We need to understand which BAM files should be used as an input. I personally think that those should be the raw ones. However e.g. for cell-type-neural.rep-1.sra-neural.GRCm38_87*bam there are 17 ! different files @bmmalone - please advise which one to use.

3) @eboileau - do you feel more comfortable with python or bash ? There is a python binding to samtools called pysam (different to pybedtools) http://pysam.readthedocs.io/en/latest/api.html

CDieterich commented 6 years ago

for 2) I suppose it is /prj/b-tea/analysis/RPF/without-rrna-mapping/soma.riboseq.cell-type-neural.rep-1.sra-neural.GRCm38_87.tied-variance.mergedAligned.sortedByCoord.out.bam

Correct ?

eboileau commented 6 years ago
  1. These files seem to have different headers... I'll have to filter out some of the data and index them, as they were not indexed. Are they references to transcripts instead of genes? Some reference names are numbers only...?
  2. For now bash is fine, I will look at pysam. Thanks.
eboileau commented 6 years ago

@bmmalone - ok, I'm getting clearer about the data, these files (.tied-variance.mergedAligned.sortedByCoord.out.) would not work straight away as they are chromosome mappings, correct? Downsampling these reads would affect many genes... The file I picked initially had gene mappings, as far as I understood, so this still made sense then. I'm trying to find out which file you need ultimately for your analysis, please advise which file I should use?! Thanks.

bmmalone commented 6 years ago

Hi all,

Going back to the questions from @CDieterich ...

  1. I advocate directly working with the neurite/soma data since we have the proteomics data. The data in /prj/punch-p-riboseq was also fine, though. Both should have rpbp reports, but just let me know if something is missing.

  2. Regarding the files in general, rather than directly answering the specific questions, I hope general documentation (which includes the relevant files) helps better. So please checkout the dev docs for b-tea for a complete documentation of the files created by the pipelines. If something is missing, confusing, or seems wrong, please just let me know... here, there, or wherever. It eventually needs to be correct anyway.

    a. Specifically, for this project, ".tied-variance" is part of the \<note\> field in the config file, so that is where it comes from in the filenames; it is not something internal. It has to do with a specific entry in the config file.

    At that point in the pipeline, the note "tied-variance" part of the filename does not matter.

  3. If you are using python, I highly recommend pysam. Unlike pybedtools*, it is a very nice wrapper around the samtools c interface.

Have a good day, Brandon

eboileau commented 6 years ago

Brandon, thanks for your reply.

  1. Yes, I've started to read the docs for b-tea... as I wasn't sure about genome vs. transcriptome alignment files...
eboileau commented 6 years ago

@Brandon, ok... the RFP files are not indexed... when I try to sort and index them (I need to have the files indexed), there is some bug which I have not figured out yet... Sorting seems ok, or at least completes without error, with default option and by read name (-n), but indexing fails with the following errors for default and by read name, respectively: [E::hts_idx_push] Unsorted positions on sequence #1972: 1 followed by 0 and [E::hts_idx_push] Chromosome blocks not continuous

I'm not sure what is happening when I look at the file... The corresponding RNA files (without the length/offset) are indexed already, so the problem is for RFP only. Is there something in the way those files are generated that could point to the problem?

bmmalone commented 6 years ago

I identified the source of the problem: some of the reads are ending up with negative coordinates in transcript space, and samtools sorts them to the end of the file. I am now tracking down why this happens.

P.S. The "default" sort should be used, like this: samtools sort my.bam -@10 -o sorted.bam

bmmalone commented 6 years ago

This happens when we have opposite-strand alignments for the transcript in question. For example:

(snippet of a bam file columns: read name, alignment direction, transcript name, transcript position)

ERR1551313.24098878     0       ENSMUSG00000005510.merged       19
ERR1551313.1601355      16      ENSMUSG00000005510.merged       -8

The second alignment is sorted to the bottom of the bam file, and this later causes the

[E::hts_idx_push] Chromosome blocks not continuous

error when indexing.

bmmalone commented 6 years ago

As far as I can tell, there is not a way to tell STAR to discard reverse-strand, transcriptomic alignments.

A work-around is given in Issue #9. Essentially, you can use this code:

# remove reverse-strand alignments
samtools view -F 16 -b transcriptome.sorted.bam > transcriptome.sorted.forward-only.bam 

# create the index as normal
samtools index -b transcriptome.sorted.forward-only.bam transcriptome.sorted.forward-only.bam.bai