GangCaoLab / CoolBox

Jupyter notebook based genomic data visualization toolkit.
https://gangcaolab.github.io/CoolBox/index.html
GNU General Public License v3.0
224 stars 37 forks source link

Bug in GTF handling [patch offered] #90

Closed ggilestro closed 4 months ago

ggilestro commented 9 months ago

Hi, the code below is problematic for two reasons:

https://github.com/GangCaoLab/CoolBox/blob/36a86b20e032c6200d6f4077a5b241c0dbda2a78/coolbox/core/track/gtf.py#L109-L123

1) it does not do a sanity check for NaN values when the name_attr is not set to auto. This means that any NaN will be passed as a label to DnaFeaturesViewer and the code will crash because it tries to split a float. 2) The regex will no work if name_attr is the last of the list.

The code can be fixed doing a sanity check for NaN out outside of the if..else and adjusting the regular expression pattern, in the following way:

        name_attr = self.properties.get("name_attr", "auto")
        if name_attr == "auto":
            gene_name = df['attribute'].str.extract(".*gene_name (.*?) ").iloc[:, 0].str.strip('\";')
            if gene_name.hasnans:
                gene_id = df['attribute'].str.extract(".*gene_id (.*?) ").iloc[:, 0].str.strip('\";')
                gene_name.fillna(gene_id, inplace=True)
        else:
            gene_name = df['attribute'].str.extract(f".*{name_attr} (.*?)(?:[ ;])").iloc[:, 0].str.strip('\";')

        if gene_name.hasnans:
            pos_str = df['seqname'].astype(str) + ":" +\
                      df['start'].astype(str) + "-" +\
                      df['end'].astype(str)
            gene_name.fillna(pos_str, inplace=True)

        df['feature_name'] = gene_name
        return df

Hope this helps.

Nanguage commented 9 months ago

I'm very sorry, I currently don't have the time and energy to maintain this project. Thank you very much for your suggestion. If you could submit a Pull Request, I would be happy to merge it.