Make it possible to enter fa/fai file on command line instead of genome

biocore-ntnu / epic

(DEPRECATED) epic: diffuse domain ChIP-Seq caller based on SICER

http://bioepic.readthedocs.io

MIT License

31 stars 6 forks source link

Make it possible to enter fa/fai file on command line instead of genome #2

Closed mictadlo closed 8 years ago

mictadlo commented 8 years ago

Hello, instead of hard coding chromosome name and lengths. Could you for example use FASTA index created by samtools:

def addGenomeData(input_filename):
    genomeData = {}

    with open(input_filename) as i:
        for line in i:
            try:
                parts = line.rstrip().split('\t')
            except IndexError:
                continue
            genomeData[parts[0]] = parts[1]

    print genomeData
    print genomeData.keys()

if __name__ == "__main__":

    addGenomeData("/data//Bactrocera_tryoni/Bac.fasta.fai")

or to use http://nullege.com/codes/search/pysam.faidx so the user only has to provide FASTA file and pysam will create the FASTA index.

If Epic could be used without hard coding genome sizes than it will be possible to use it in galaxy (use galaxy.org).

Thank you in advance.

Mic

endrebak commented 8 years ago

Will seriously consider your suggestion. I have been looking for a better way to include genome info. Thanks!

endrebak commented 8 years ago

I think I will use pyfaidx to create those indexes since it is 100% Python, but otherwise I like your idea. Almost everyone has a fasta of the genome they use.

I think I will keep a hash of genomes for 1) speed gains and simplicity and 2) in case someone does not have the fasta.

Without the built-in hash people would have to

wait for an index to be created if they have no index, only the fasta
enter a path on the command line for every invocation of epic
have a copy of the fasta themselves.

Btw, this is what pyfaidx created indexes look like:

endrebak@havpryd ~/genomes> head hg19.fa.fai
chr1    249250621   6   50  51
chr2    243199373   254235646   50  51
chr3    198022430   502299013   50  51
chr4    191154276   704281898   50  51
chr5    180915260   899259266   50  51
chr6    171115067   1083792838  50  51
chr7    159138663   1258330213  50  51
chr8    146364022   1420651656  50  51
chr9    141213431   1569942965  50  51
chr10   135534747   1713980672  50  51

endrebak commented 8 years ago

@mictadlo

I came to think of one problem: you still do not know the effective genome size of your genome and that is not something that can be easily gleaned from a fasta file.

I can automatically compute it, but then jellyfish would be a requirement for epic to work. Would having a requirement on another tool (jellyfish2) be bad, galaxy-wise?

You probably do not want to recompute the egs each time so I guess I would have to add a command line argument for that.

mictadlo commented 8 years ago

Hi @endrebak Looking for example at BWA in https://toolshed.g2.bx.psu.edu it depends as well on SAMTOOLS. While BWA is running it pipes the output to SAMTOOLS in order to convert it BAM format.

There is already a package for jellyfish in the Galaxy toolshed, but the version seems to be a little old.

It is not necessary to add a command line argument for jellyfish. In your code you assume that jellyfish is in the path. In the wrapper for galaxy it is possible to make a dependency to a jellyfish package.

endrebak commented 8 years ago

I consider this fixed with #3

Computing the egs from a fasta each time the software is run is prohibitively expensive (but if it were only the chromsizes we needed from the fasta, your idea would have worked).