Closed holgerbrandl closed 6 years ago
Is it possible to change the names of the chromosomes in a bigwig file using the API? Like to add a chr prefix.
You can query all wig sections from an input file, modify as you wish and write them to an output file. I can provide you a code example later today.
So somehow by using BigWigFile.read and BigWigFile.write together in a streaming manner.
Could you explain in detail? You'd like to use streaming/function API just to get better source code or it is critical for you not to load all bigwig file content into a memory?
At the moment public API doesn't provide streaming API. It could read data from selected rage/whole chromosome to List<WigSection>
and write
method expects Iterable<WigSection>
. So you can write the source code in a functional style in steaming manner but it will not reduce memory consumption. Internally the library uses streaming API and we could add required public API if it is required.
Code example for your task.
It requires chromosomes sizes mapping which you can download from UCSC for the genome which matches your big wig file. I guess you may use end offsets of the last wig sections for each chromosome, but I'm not sure about it.
This implementation loads all wig sections to the memory. Is it ok or you need true streaming implementation?
import org.apache.log4j.*
import org.jetbrains.bio.big.*
import java.nio.file.Files
import java.nio.file.Paths
object BigWigConverter {
val LOG = Logger.getLogger(BigWigConverter::class.java)!!
@JvmStatic
fun main(args: Array<String>) {
// E.g. download from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes
val hg19ChromSizesPath = Paths.get("/path/hg19.chrom.sizes")
val inputFile = "/path/in.bw"
val outputFile = "/path/out.bw"
// Optional logger
Logger.getRootLogger().apply {
addAppender(ConsoleAppender(SimpleLayout()))
level = Level.INFO
}
LOG.info("Loading chromosome sizes..")
val chromSizes = Files.readAllLines(hg19ChromSizesPath)
.asSequence()
.map { line ->
val chunks = line.split("\t", limit = 2)
require(chunks.size == 2)
chunks[0] to chunks[1].toInt()
}.toList()
val chromosomes = chromSizes.asSequence().map { it.first }.toSet()
LOG.info("Opening BigFile...")
BigWigFile.read(inputFile).use { bwInput ->
val compression = bwInput.compression
LOG.info("Loading Wig sections...")
val wigSections = bwInput.chromosomes
.valueCollection()
.flatMap { chr ->
require("chr$chr" in chromosomes) {
"Chromosome length isn't specified for chromosome: $chr"
}
bwInput.query(chr)
}
LOG.info(" Done")
LOG.info("Converting Wig sections...")
val updatedWigSections = wigSections.map {section ->
val fixedChrName = "chr${section.chrom}"
val e = when (section) {
is FixedStepSection -> section.copy(chrom = fixedChrName)
is VariableStepSection -> section.copy(chrom = fixedChrName)
is BedGraphSection -> section.copy(chrom = fixedChrName)
else -> error("Unknown section type: ${section.javaClass}")
}
e
}
LOG.info(" Done")
LOG.info("Writing Wig sections...")
BigWigFile.write(updatedWigSections, chromSizes, Paths.get(outputFile), compression = compression)
LOG.info(" Done")
}
}
}
I've tried to compare stream-based impl (using private api) with the above impl:
Memory consumption for the above impl: Used 1.4 GB, allocated 1.9 GB
Now steaming manner Used 1.2 GB, allocated 1.5 GB
The difference isn't very impressive. In stream-based impl at ~3:20-4:00 writing to file started. It requires to collect some data in order to build indexes, so memory profit isn't big.
Thanks @iromeo for your kind help. It works just great, and I have wrapped into a self-contained little tool with https://gist.github.com/holgerbrandl/3d88c11519340ce915f964d0845d4c91
@iromeo Could you let me know what profiler/visualization used for such a nice analysis?
@hkmoon It is YourKit Profiler https://www.yourkit.com/java/profiler/features/. I exported memory allocation stats from there and it nicely split long graphs in several rows
@iromeo @hkmoon Starting from IntellIJ IDEA 2018.3 EAP you can use built-in profiler for this purposes, see for details: https://blog.jetbrains.com/idea/2018/09/intellij-idea-2018-3-eap-git-submodules-jvm-profiler-macos-and-linux-and-more/
@holgerbrandl Oh, great! As a possible enhancement, you may probably omit using chrom sizes
file and infer chromosome length from the input file. Implementation loads all wig sections for each chromosome, so you may use the end offset of the latest wig section as a chromosome length. @superbobry, am I right?
Since we just rewrite names this might would be indeed a nice improvement.
The harder part (which is unrelated to this thread) is that kscript
currently struggles with the dynamic dependencies in big
. In theory they are supported but maven somehow can not properly resolve them. See https://github.com/holgerbrandl/kscript/issues/166
@iromeo @olegs Thank you for the information which is really useful for my projects. It is also awesome news to know that we have a built-in profiler in IntelliJ IDEA 2018.3 EAP.
Is it possible to change the names of the chromosomes in a bigwig file using the API? Like to add a
chr
prefix.So somehow by using
BigWigFile.read
andBigWigFile.write
together in a streaming manner.Obviously this is not really a ticket but more a question, and I'm happy to move over kotlin slack if desired.