Closed michael-kotliar closed 2 years ago
On average each pileup call takes about 0.0025 seconds
count_coverage
works much faster (6 sec vs 147 sec for chr2), but it doesn't give me the option to work with the read, so I can't check its mate.
Are all reads paired alignments? I find some where only read 1 or read 2 is aligned.
Best, Nathan
On Jan 13, 2022, at 1:05 PM, pdexheimer @.***> wrote:
@pdexheimer commented on this pull request.
In src/bam_to_bed/intron_retention.py:
+# 2. Convert to BED format excluding all not intron records +# cat Hs_Ensembl_exon.txt | grep -v "gene" | awk '$2 ~ /^I/ {print $3"\t"$5"\t"$6"\t"$2"-"$1"\t"0"\t"$4}' | sort -k1,1 -k2,2n -k3,3n | bgzip > hs_ref.bed.gz +# tabix -p bed hs_ref.bed.gz +############################################################################################################################################################### + + +################################################################################################# +# DEPRECATED INSTRUCTIONS +# keep it in case we decide to support GTF input in future +# +# (grep ^"#" input.gtf; grep -v ^"#" input.gtf | sort -k1,1 -k4,4n) | bgzip > input.sorted.gtf.gz +# tabix -p gff input.sorted.gtf.gz +################################################################################################# + + +def is_paired(args): I think we should change the logic of this method. Right now, it examines 20 reads and returns False if any of them do not have a mate or the mate is unmapped. It also seeks all over the file, because mate() actually updates the iterator to point at the mate read.
I think that if a file represents paired data, all reads in the files will be paired. I think it would be sufficient to do something like this:
with pysam.AlignmentFile(args.bam, mode="rb", threads=args.threads) as bam_handler: read = next(bam_handler.fetch()) return read.is_paired() — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.
Yeah, that's one of the subtleties here. AlignmentFile.mate() requires that the mate read exists and is mapped. AlignedSegment.is_paired() returns the status of the 0x0001 flag in the BAM's FLAG field, which is simply "Were two reads generated by the sequencer?"
There's also AlignedSegment.is_proper_pair, which returns the 0x0002 flag. Unfortunately, the criteria for calling something a "proper" pair is completely up to the aligner.
Another side note, it's actually not clear to me from the documentation whether is_paired/is_proper_paired are methods or properties, so my syntax might be a little off
How to prepare an intron reference file
To obtain a properly formatted coordinate sorted indexed gene model reference BED file that includes only introns, do the following actions with your AltAnalyze generated
Hs_Ensembl_exon.txt
(for example) file.Hs_Ensembl_exon.txt
fileHow to run the program
optional arguments: -h, --help show this help message and exit --bam BAM Path to the coordinate-sorted indexed BAM file --ref REF Path to the coordinate-sorted indexed gene model reference BED file --span SPAN 5' and 3' overlap that read should have over a splice-site to be counted --strandness {auto,forward,reverse,unstranded} Strand specificty of the RNA library.'unstranded' - reads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand. 'forward' - same as 'unstranded' except we enforce the rule that the left-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during second strand synthesis is sequenced. Used for Ligation and Standard SOLiD. 'reverse' - same as 'unstranded' except we enforce the rule that the right- most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during first strand synthesis is sequenced. Used for dUTP, NSR, and NNSR. Default: first 'auto' (try to detect strand from the XS tag of the read), then downgrade to 'unstranded' --threads THREADS Number of threads to decompress BAM file --cpus CPUS Number of processes to run in parallel --chr [CHR [CHR ...]] Select chromosomes to process. Default: all available --loglevel {fatal,error,warning,info,debug} Logging level. Default: info --savereads Export processed reads into the BAM file. Default: False --output OUTPUT Output file prefix