SchulzLab / TEPIC

Annotation of genomic regions using transcription factor binding sites and epigenetic data
MIT License
40 stars 9 forks source link

Chromosome names #12

Closed DoaneAS closed 7 years ago

DoaneAS commented 7 years ago

Hi, Thanks for developing this. I'm running TEPIC on some human ATACseq data, aligned to hg19. A few seconds after starting, there are warnings about missing chromosomes in my FASTA file. The chromosomes it can't find are chromosome (1), chromosome (2), .... chromosome (X). As all of my data is formatted for hg19, like chromosome (chr1).

The analysis fails with an index error, and the output is mostly empty. One exception is the Peak_Coverage.txt, which appears to be like a bed file with a score value for each peak, but chromosome names are changed to reseq format.

The output is like this:

WARNING. chromosome (X) was not found in the FASTA file. Skipping.   
Converting invalid characters
Starting TRAP
^[[BTraceback (most recent call last):
  File "scaleAffinity.py", line 34, in <module>
    main()
  File "scaleAffinity.py", line 18, in main
    factor=float(s[3])
IndexError: list index out of range 

thanks

Xethic commented 7 years ago

Hi,

we are sorry that you encountered problems while running our software. We will get this fixed asap.

It seems that the chromosome names in your input fasta file do not exactly match the chromosome notation in your input bed file. Background: We internally use bedtools to extract the fasta sequences of the input fasta file that fall into the intervals given by the input bed file. Thereby, of course, a matching between the chromosomes has to be established. Consequently, if the matching fails the latter input files for our downstream applications are kaputt or even empty. This can lead to such a spurious behavior.

Could you pls check and compare the chromosome notations in your input files (fasta & bed)? Alternatively, you can send us some of the first rows of both files and we will guide you through it.

Moreover, I think we should adapt our pipeline to stop when bedtools cannot establish the matching.

@Florian411 This is directly related to this question: Error In Bedtools Getfasta: Chromosome Not Found. It seems to me that bedtools creates an index, which in case of failure is not deleted and remains in a broken state. This leads to repeating failures even after the input files are corrected. Do you agree?

Best Regards,

Fabian Kern

DoaneAS commented 7 years ago

Hi, I checked my chromosome files, and my input files actually match. I think I found the bug though.

In your script called removeInvalidGenomicPositions.py, chromosome names are reformatted to a refseq format (without the "chr", so "chr1" becomes just "1"), and a new bed file is written. The problem arises because this bed file now does not match my fasta genome file.

So for example, I input a file like this:

chr1    714173  714401  X00002.100288069    0   .
chr1    762860  763179  X00003.79854    0   .
chr1    911546  911686  X00004.339451   0   .
chr1    948732  948925  X00005.9636 0   .
chr1    954753  954968  X00006.375790   0   .

and the script writes a new file:

1   714173  714401  X00002.100288069    0   .
1   762860  763179  X00003.79854    0   .
1   911546  911686  X00004.339451   0   .

But, my genome file is still in UCSC format, and this causes bedtools to fail. I changed the code in the removeInvalidGenomicPositions.py to keep the chromosome names, and this seems to solve the immediate problem, but might cause other problems...

Florian411 commented 7 years ago

Hi, Yes, this is the problem. We assume that the reference genome is in refseq format (as you say, without the "chr" prefix). I will add this to the documentation. The easiest solution for you right now would be to just use the sed command to remove the chr prefix from your reference genome. On the long term, we will change the code, such that it checks which format the reference genome is using and reformat the remaining files accordingly. Sorry for the inconvenience.