BioJulia / BioTutorials

Tutorial Notebooks of BioJulia
29 stars 7 forks source link

Update RNA-Seq Coverage.ipynb #2

Closed aleighbrown closed 2 years ago

aleighbrown commented 4 years ago

just changed a period to make it function properly. Just putting it out here to give it a nudge for updates.

kescobo commented 4 years ago

Thanks! does this actually run now?

aleighbrown commented 4 years ago

Nope..spoke too soon on that...

intersect isn't working as advertised. To be fair I'm not using the arabidopsis file but a subsampled version of my own data(mouse), and just checking coverage at Gapdh samtools view /data/24wt_sub.bam chr6:125159852-125168467 | wc -l

6221

but with the as described method of using intersect

chrom = "chr6" leftpos = 125159852 rightpos = 125168467 intersect(, , ) returns an iterator of BAM records that overlaps the specified genomic range. Let's check that the number of returned records matches that of samtools view:

count(_ -> true, intersect(reader, chrom, leftpos:rightpos)) 0

Just in case it was something silly I've also checked chrom = "6", chrom = 6, chrom = "Chr6", chrom = '6'

kescobo commented 4 years ago

Hmm - does the test arabidopsis data give the expected result? TBH, I don't deal with this type of data myself, so I won't be much help running it down. @ciaranomara has all the commits on XAM.jl - can they help?

CiaranOMara commented 4 years ago

I'll check whether I'm able to reproduce the results of the Arabidopsis thaliana data, then we may delve in from there.

CiaranOMara commented 4 years ago

The tutorial is certainly out of date. @aleighbrown, I think you'd want to use the eachoverlap function from the BioAlignments v1 package (XAM isn't ready).

I've included my rough workflow below.

# Download Reference genome: Arabidopsis thaliana (assembly TAIR10.1) https://www.ncbi.nlm.nih.gov/genome/?term=txid3702[orgn].
path_genome = download("ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz", joinpath(@__DIR__, "GCF_000001735.4_TAIR10.1_genomic.fna.gz"))

# Unzip genome.
run(`gunzip $path_genome`)

# Download HISAT2 binary (http://ccb.jhu.edu/software/hisat2/index.shtml).
# path_hisat = download("ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip")
path_hisat = download("ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-OSX_x86_64.zip")
# path_hisat = download("http://www.di.fc.ul.pt/~afalcao/hisat2.1/hisat2.1_Windows.zip")

# Unzip HISAT2.
run(`unzip $path_hisat -d $(@__DIR__)`)

# Build index.
run(`hisat2-2.1.0/hisat2-build GCF_000001735.4_TAIR10.1_genomic.fna GCF_000001735.4_TAIR10.1_genomic.fna`)

# Download SRR1238088.
run(`fasterq-dump SRR1238088`)

# Align.
run(`hisat2-2.1.0/hisat2 -x GCF_000001735.4_TAIR10.1_genomic.fna -1 SRR1238088_1.fastq -2 ./SRR1238088_2.fastq -S SRR1238088.sam`)

# Convert SAM to BAM.
run(`samtools view -Sb SRR1238088.sam -o SRR1238088.bam`)

# Sort BAM.
run(`samtools sort SRR1238088.bam -o SRR1238088.sort.bam`)

# Index BAM.
run(`samtools index SRR1238088.sort.bam`)

# The important bit.
using BioAlignments #v1.0.0

# AT3G11820
chrom = "NC_003074.8" # Chr3 (see  https://www.ncbi.nlm.nih.gov/genome/?term=txid3702[orgn])
leftpos = 3729305
rightpos = 3731116

bamfile = "SRR1238088.sort.bam"

open(BAM.Reader, bamfile, index=bamfile * ".bai") do reader
    count(_ -> true, eachoverlap(reader, chrom, leftpos:rightpos))
end

read(pipeline(`samtools view SRR1238088.sort.bam $chrom:$leftpos-$rightpos`, `wc -l`), String) |> strip

In terms of the "Chr6" stuff, you can get the names from the reader.

open(BAM.Reader, bamfile, index=bamfile * ".bai") do reader
    reader.refseqnames
end
kescobo commented 2 years ago

Closed because I changed the default branch to main from master, but would be great to get this updated.

Probably now could use XAM.jl, but I would prefer to skip using other deps (like hisat) if possible, but if we can't, could use @CiaranOMara 's idea above