JetBrains-Research / big

BigWIG, BigBED and TDF for the JVM
MIT License
13 stars 2 forks source link

How to change chromosome names in a bigwig file? #40

Closed holgerbrandl closed 6 years ago

holgerbrandl commented 6 years ago

Is it possible to change the names of the chromosomes in a bigwig file using the API? Like to add a chr prefix.

So somehow by using BigWigFile.read and BigWigFile.write together in a streaming manner.

Obviously this is not really a ticket but more a question, and I'm happy to move over kotlin slack if desired.

iromeo commented 6 years ago

Is it possible to change the names of the chromosomes in a bigwig file using the API? Like to add a chr prefix.

You can query all wig sections from an input file, modify as you wish and write them to an output file. I can provide you a code example later today.

So somehow by using BigWigFile.read and BigWigFile.write together in a streaming manner.

Could you explain in detail? You'd like to use streaming/function API just to get better source code or it is critical for you not to load all bigwig file content into a memory?

At the moment public API doesn't provide streaming API. It could read data from selected rage/whole chromosome to List<WigSection> and write method expects Iterable<WigSection>. So you can write the source code in a functional style in steaming manner but it will not reduce memory consumption. Internally the library uses streaming API and we could add required public API if it is required.

iromeo commented 6 years ago

Code example for your task.

It requires chromosomes sizes mapping which you can download from UCSC for the genome which matches your big wig file. I guess you may use end offsets of the last wig sections for each chromosome, but I'm not sure about it.

This implementation loads all wig sections to the memory. Is it ok or you need true streaming implementation?

import org.apache.log4j.*
import org.jetbrains.bio.big.*
import java.nio.file.Files
import java.nio.file.Paths

object BigWigConverter {
    val LOG = Logger.getLogger(BigWigConverter::class.java)!!

    @JvmStatic
    fun main(args: Array<String>) {
        // E.g. download from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes
        val hg19ChromSizesPath = Paths.get("/path/hg19.chrom.sizes")
        val inputFile = "/path/in.bw"
        val outputFile = "/path/out.bw"

        // Optional logger
        Logger.getRootLogger().apply {
            addAppender(ConsoleAppender(SimpleLayout()))
            level = Level.INFO
        }

        LOG.info("Loading chromosome sizes..")
        val chromSizes = Files.readAllLines(hg19ChromSizesPath)
                .asSequence()
                .map { line ->
                    val chunks = line.split("\t", limit = 2)
                    require(chunks.size == 2)
                    chunks[0] to chunks[1].toInt()
                }.toList()
        val chromosomes = chromSizes.asSequence().map { it.first }.toSet()

        LOG.info("Opening BigFile...")
         BigWigFile.read(inputFile).use { bwInput ->
             val compression = bwInput.compression

             LOG.info("Loading Wig sections...")
             val wigSections = bwInput.chromosomes
                     .valueCollection()
                     .flatMap { chr ->
                         require("chr$chr" in chromosomes) {
                             "Chromosome length isn't specified for chromosome: $chr"
                         }
                         bwInput.query(chr)
                     }
             LOG.info("  Done")

             LOG.info("Converting Wig sections...")
             val updatedWigSections = wigSections.map {section ->
                 val fixedChrName = "chr${section.chrom}"
                 val e = when (section) {
                     is FixedStepSection -> section.copy(chrom = fixedChrName)
                     is VariableStepSection -> section.copy(chrom = fixedChrName)
                     is BedGraphSection -> section.copy(chrom = fixedChrName)
                     else -> error("Unknown section type: ${section.javaClass}")
                 }
                 e
             }
             LOG.info("  Done")

             LOG.info("Writing Wig sections...")
             BigWigFile.write(updatedWigSections, chromSizes, Paths.get(outputFile), compression = compression)
             LOG.info("  Done")
         }
    }
}
iromeo commented 6 years ago

I've tried to compare stream-based impl (using private api) with the above impl:

Memory consumption for the above impl: Used 1.4 GB, allocated 1.9 GB converter

Now steaming manner Used 1.2 GB, allocated 1.5 GB converter_streamed

The difference isn't very impressive. In stream-based impl at ~3:20-4:00 writing to file started. It requires to collect some data in order to build indexes, so memory profit isn't big.

holgerbrandl commented 6 years ago

Thanks @iromeo for your kind help. It works just great, and I have wrapped into a self-contained little tool with https://gist.github.com/holgerbrandl/3d88c11519340ce915f964d0845d4c91

hkmoon commented 6 years ago

@iromeo Could you let me know what profiler/visualization used for such a nice analysis?

iromeo commented 6 years ago

@hkmoon It is YourKit Profiler https://www.yourkit.com/java/profiler/features/. I exported memory allocation stats from there and it nicely split long graphs in several rows

olegs commented 6 years ago

@iromeo @hkmoon Starting from IntellIJ IDEA 2018.3 EAP you can use built-in profiler for this purposes, see for details: https://blog.jetbrains.com/idea/2018/09/intellij-idea-2018-3-eap-git-submodules-jvm-profiler-macos-and-linux-and-more/

iromeo commented 6 years ago

@holgerbrandl Oh, great! As a possible enhancement, you may probably omit using chrom sizes file and infer chromosome length from the input file. Implementation loads all wig sections for each chromosome, so you may use the end offset of the latest wig section as a chromosome length. @superbobry, am I right?

holgerbrandl commented 6 years ago

Since we just rewrite names this might would be indeed a nice improvement.

The harder part (which is unrelated to this thread) is that kscript currently struggles with the dynamic dependencies in big. In theory they are supported but maven somehow can not properly resolve them. See https://github.com/holgerbrandl/kscript/issues/166

hkmoon commented 6 years ago

@iromeo @olegs Thank you for the information which is really useful for my projects. It is also awesome news to know that we have a built-in profiler in IntelliJ IDEA 2018.3 EAP.