SalomonisLab / altanalyze3

Apache License 2.0
2 stars 3 forks source link

Add script for intron retention (work in progress) #6

Closed michael-kotliar closed 2 years ago

michael-kotliar commented 2 years ago

How to prepare an intron reference file

To obtain a properly formatted coordinate sorted indexed gene model reference BED file that includes only introns, do the following actions with your AltAnalyze generated Hs_Ensembl_exon.txt (for example) file.

  1. Make sure you have these columns in you input Hs_Ensembl_exon.txt file
    head -n 1 Hs_Ensembl_exon.txt
    # Expected output
    # gene exon-id chromosome strand exon-region-start(s) exon-region-stop(s) constitutive_call ens_exon_ids splice_events splice_junctions
  2. Convert to BED format excluding all not intron records
    cat Hs_Ensembl_exon.txt | grep -v "gene" | awk '$2 ~ /^I/ {print $3"\t"$5"\t"$6"\t"$2"-"$1"\t"0"\t"$4}' | sort -k1,1 -k2,2n -k3,3n | bgzip > hs_ref.bed.gz
    tabix -p bed hs_ref.bed.gz

    How to run the program

    
    usage: intron_retention.py [-h] --bam BAM --ref REF [--span SPAN] [--strandness {auto,forward,reverse,unstranded}] [--threads THREADS] [--cpus CPUS] [--chr [CHR [CHR ...]]]
                           [--loglevel {fatal,error,warning,info,debug}] [--savereads] [--output OUTPUT]

optional arguments: -h, --help show this help message and exit --bam BAM Path to the coordinate-sorted indexed BAM file --ref REF Path to the coordinate-sorted indexed gene model reference BED file --span SPAN 5' and 3' overlap that read should have over a splice-site to be counted --strandness {auto,forward,reverse,unstranded} Strand specificty of the RNA library.'unstranded' - reads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand. 'forward' - same as 'unstranded' except we enforce the rule that the left-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during second strand synthesis is sequenced. Used for Ligation and Standard SOLiD. 'reverse' - same as 'unstranded' except we enforce the rule that the right- most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during first strand synthesis is sequenced. Used for dUTP, NSR, and NNSR. Default: first 'auto' (try to detect strand from the XS tag of the read), then downgrade to 'unstranded' --threads THREADS Number of threads to decompress BAM file --cpus CPUS Number of processes to run in parallel --chr [CHR [CHR ...]] Select chromosomes to process. Default: all available --loglevel {fatal,error,warning,info,debug} Logging level. Default: info --savereads Export processed reads into the BAM file. Default: False --output OUTPUT Output file prefix


## How to debug
- To select only specific chromosomes use `--chr` parameter.
- To run program in verbose mode use `--loglevel debug`
- To save all used reads into BAM file use `--savereads`. **Note, this slows down all computations.** Resulted BAM file can be opened in IGV and colored by `XI` flag
michael-kotliar commented 2 years ago

On average each pileup call takes about 0.0025 seconds count_coverage works much faster (6 sec vs 147 sec for chr2), but it doesn't give me the option to work with the read, so I can't check its mate.

nsalomonis commented 2 years ago

Are all reads paired alignments? I find some where only read 1 or read 2 is aligned.

Best, Nathan

On Jan 13, 2022, at 1:05 PM, pdexheimer @.***> wrote:

 @pdexheimer commented on this pull request.

In src/bam_to_bed/intron_retention.py:

+# 2. Convert to BED format excluding all not intron records +# cat Hs_Ensembl_exon.txt | grep -v "gene" | awk '$2 ~ /^I/ {print $3"\t"$5"\t"$6"\t"$2"-"$1"\t"0"\t"$4}' | sort -k1,1 -k2,2n -k3,3n | bgzip > hs_ref.bed.gz +# tabix -p bed hs_ref.bed.gz +############################################################################################################################################################### + + +################################################################################################# +# DEPRECATED INSTRUCTIONS +# keep it in case we decide to support GTF input in future +# +# (grep ^"#" input.gtf; grep -v ^"#" input.gtf | sort -k1,1 -k4,4n) | bgzip > input.sorted.gtf.gz +# tabix -p gff input.sorted.gtf.gz +################################################################################################# + + +def is_paired(args): I think we should change the logic of this method. Right now, it examines 20 reads and returns False if any of them do not have a mate or the mate is unmapped. It also seeks all over the file, because mate() actually updates the iterator to point at the mate read.

I think that if a file represents paired data, all reads in the files will be paired. I think it would be sufficient to do something like this:

with pysam.AlignmentFile(args.bam, mode="rb", threads=args.threads) as bam_handler: read = next(bam_handler.fetch()) return read.is_paired() — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.

pdexheimer commented 2 years ago

Yeah, that's one of the subtleties here. AlignmentFile.mate() requires that the mate read exists and is mapped. AlignedSegment.is_paired() returns the status of the 0x0001 flag in the BAM's FLAG field, which is simply "Were two reads generated by the sequencer?"

There's also AlignedSegment.is_proper_pair, which returns the 0x0002 flag. Unfortunately, the criteria for calling something a "proper" pair is completely up to the aligner.

Another side note, it's actually not clear to me from the documentation whether is_paired/is_proper_paired are methods or properties, so my syntax might be a little off

michael-kotliar commented 2 years ago