Closed aleighbrown closed 2 years ago
Thanks! does this actually run now?
Nope..spoke too soon on that...
intersect isn't working as advertised. To be fair I'm not using the arabidopsis file but a subsampled version of my own data(mouse), and just checking coverage at Gapdh
samtools view /data/24wt_sub.bam chr6:125159852-125168467 | wc -l
6221
but with the as described method of using intersect
chrom = "chr6" leftpos = 125159852 rightpos = 125168467
intersect(
count(_ -> true, intersect(reader, chrom, leftpos:rightpos)) 0
Just in case it was something silly I've also checked chrom = "6", chrom = 6, chrom = "Chr6", chrom = '6'
Hmm - does the test arabidopsis data give the expected result? TBH, I don't deal with this type of data myself, so I won't be much help running it down. @ciaranomara has all the commits on XAM.jl - can they help?
I'll check whether I'm able to reproduce the results of the Arabidopsis thaliana data, then we may delve in from there.
The tutorial is certainly out of date. @aleighbrown, I think you'd want to use the eachoverlap
function from the BioAlignments
v1 package (XAM isn't ready).
I've included my rough workflow below.
# Download Reference genome: Arabidopsis thaliana (assembly TAIR10.1) https://www.ncbi.nlm.nih.gov/genome/?term=txid3702[orgn].
path_genome = download("ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz", joinpath(@__DIR__, "GCF_000001735.4_TAIR10.1_genomic.fna.gz"))
# Unzip genome.
run(`gunzip $path_genome`)
# Download HISAT2 binary (http://ccb.jhu.edu/software/hisat2/index.shtml).
# path_hisat = download("ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip")
path_hisat = download("ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-OSX_x86_64.zip")
# path_hisat = download("http://www.di.fc.ul.pt/~afalcao/hisat2.1/hisat2.1_Windows.zip")
# Unzip HISAT2.
run(`unzip $path_hisat -d $(@__DIR__)`)
# Build index.
run(`hisat2-2.1.0/hisat2-build GCF_000001735.4_TAIR10.1_genomic.fna GCF_000001735.4_TAIR10.1_genomic.fna`)
# Download SRR1238088.
run(`fasterq-dump SRR1238088`)
# Align.
run(`hisat2-2.1.0/hisat2 -x GCF_000001735.4_TAIR10.1_genomic.fna -1 SRR1238088_1.fastq -2 ./SRR1238088_2.fastq -S SRR1238088.sam`)
# Convert SAM to BAM.
run(`samtools view -Sb SRR1238088.sam -o SRR1238088.bam`)
# Sort BAM.
run(`samtools sort SRR1238088.bam -o SRR1238088.sort.bam`)
# Index BAM.
run(`samtools index SRR1238088.sort.bam`)
# The important bit.
using BioAlignments #v1.0.0
# AT3G11820
chrom = "NC_003074.8" # Chr3 (see https://www.ncbi.nlm.nih.gov/genome/?term=txid3702[orgn])
leftpos = 3729305
rightpos = 3731116
bamfile = "SRR1238088.sort.bam"
open(BAM.Reader, bamfile, index=bamfile * ".bai") do reader
count(_ -> true, eachoverlap(reader, chrom, leftpos:rightpos))
end
read(pipeline(`samtools view SRR1238088.sort.bam $chrom:$leftpos-$rightpos`, `wc -l`), String) |> strip
In terms of the "Chr6" stuff, you can get the names from the reader.
open(BAM.Reader, bamfile, index=bamfile * ".bai") do reader
reader.refseqnames
end
Closed because I changed the default branch to main
from master
, but would be great to get this updated.
Probably now could use XAM.jl, but I would prefer to skip using other deps (like hisat
) if possible, but if we can't, could use @CiaranOMara 's idea above
just changed a period to make it function properly. Just putting it out here to give it a nudge for updates.