Closed DoaneAS closed 7 years ago
Hi,
we are sorry that you encountered problems while running our software. We will get this fixed asap.
It seems that the chromosome names in your input fasta file do not exactly match the chromosome notation in your input bed file. Background: We internally use bedtools to extract the fasta sequences of the input fasta file that fall into the intervals given by the input bed file. Thereby, of course, a matching between the chromosomes has to be established. Consequently, if the matching fails the latter input files for our downstream applications are kaputt or even empty. This can lead to such a spurious behavior.
Could you pls check and compare the chromosome notations in your input files (fasta & bed)? Alternatively, you can send us some of the first rows of both files and we will guide you through it.
Moreover, I think we should adapt our pipeline to stop when bedtools cannot establish the matching.
@Florian411 This is directly related to this question: Error In Bedtools Getfasta: Chromosome Not Found. It seems to me that bedtools creates an index, which in case of failure is not deleted and remains in a broken state. This leads to repeating failures even after the input files are corrected. Do you agree?
Best Regards,
Fabian Kern
Hi, I checked my chromosome files, and my input files actually match. I think I found the bug though.
In your script called removeInvalidGenomicPositions.py, chromosome names are reformatted to a refseq format (without the "chr", so "chr1" becomes just "1"), and a new bed file is written. The problem arises because this bed file now does not match my fasta genome file.
So for example, I input a file like this:
chr1 714173 714401 X00002.100288069 0 .
chr1 762860 763179 X00003.79854 0 .
chr1 911546 911686 X00004.339451 0 .
chr1 948732 948925 X00005.9636 0 .
chr1 954753 954968 X00006.375790 0 .
and the script writes a new file:
1 714173 714401 X00002.100288069 0 .
1 762860 763179 X00003.79854 0 .
1 911546 911686 X00004.339451 0 .
But, my genome file is still in UCSC format, and this causes bedtools to fail. I changed the code in the removeInvalidGenomicPositions.py to keep the chromosome names, and this seems to solve the immediate problem, but might cause other problems...
Hi, Yes, this is the problem. We assume that the reference genome is in refseq format (as you say, without the "chr" prefix). I will add this to the documentation. The easiest solution for you right now would be to just use the sed command to remove the chr prefix from your reference genome. On the long term, we will change the code, such that it checks which format the reference genome is using and reformat the remaining files accordingly. Sorry for the inconvenience.
Hi, Thanks for developing this. I'm running TEPIC on some human ATACseq data, aligned to hg19. A few seconds after starting, there are warnings about missing chromosomes in my FASTA file. The chromosomes it can't find are chromosome (1), chromosome (2), .... chromosome (X). As all of my data is formatted for hg19, like chromosome (chr1).
The analysis fails with an index error, and the output is mostly empty. One exception is the Peak_Coverage.txt, which appears to be like a bed file with a score value for each peak, but chromosome names are changed to reseq format.
The output is like this:
thanks