BIMSBbioinfo / scregseg

Single-cell regulatory landscape segmentation
GNU General Public License v3.0
5 stars 2 forks source link

Fix mixed data type issues caused by memory efficient loading by pand… #10

Open prauten opened 2 months ago

prauten commented 2 months ago

…as when using Ensembl annotations (chromosome naming conventions).

Hi Wolfgang,

This pull request addresses issues I ran into when working with ATAC fragment files following Ensembl genome annotations ("1"/1 instead of "chr1", ..., and having headers). The memory-efficient data loading strategy from pandas (pd.read_csv - also internally called by BedTool()) led to mixed data types (e.g., chromosome 1 being represented as "1" and 1) causing inconsistent values for some chromosomes and if a header is present, start and end being read as float instead of an integer as required elsewhere in the code.

The implemented changes should be downward compatible.

Let me know if you have any questions.

Best, Pia