SysBioChalmers / GECKO

Toolbox for including enzyme constraints on a genome-scale model.
http://sysbiochalmers.github.io/GECKO/
MIT License
64 stars 46 forks source link

bug: correctly parse PaxDB if taxonomic ID > 4 digits #345

Closed Soratake-HirotakaYajima closed 7 months ago

Soratake-HirotakaYajima commented 11 months ago

on calculateFfactor.m line 57 It is only calculate four digits number.

I suppose like below. genes = regexprep(genes,'(\d{4}).',''); ==> genes = regexprep(genes,'(\d+).','');

mihai-sysbio commented 11 months ago

I'm copying here the code block for reference:

% Gather Uniprot database for finding MW
uniprotDB = loadDatabases('uniprot', modelAdapter);
uniprotDB = uniprotDB.uniprot;

if ischar(protData) && endsWith(protData,'paxDB.tsv')
    fID         = fopen(fullfile(protData),'r');
    fileContent = textscan(fID,'%s','delimiter','\n');
    headerLines = sum(startsWith(fileContent{1},'#'));
    fclose(fID);

    %Read data file, excluding headerlines
    fID         = fopen(fullfile(protData),'r');
    fileContent = textscan(fID,'%s %s %f','delimiter','\t','HeaderLines',headerLines);
    genes       = fileContent{2};
    %Remove internal geneIDs modifiers
    genes       = regexprep(genes,'(\d{4}).','');
    level       = fileContent{3};
    fclose(fID);
    [a,b]       = ismember(genes,uniprotDB.genes);
    uniprot     = uniprotDB.ID(b(a));
    level(~a)   = [];
    clear protData
    protData.uniprot = uniprot;
    protData.level   = level;
end

If I understand it right, the role of that line is to replace with nothing the first 4 digits and the period that exist in the 2nd column of the provided file, by default 'paxDB.tsv':

https://github.com/SysBioChalmers/GECKO/blob/b512ea321b3001c1ff1c6140baccd3a3566e23d0/tutorials/full_ecModel/data/paxDB.tsv#L11-L13

In this file, the column has indeed some numbers and a period preceding the gene ids that we would need.

The suggestion to not restrict it to specifically 4 characters is making the regex more generic, which is ideally what we want. My suggestion would be to further improve this by:

The end result would be then

genes = regexprep(genes,'^\d+\.','');

The line above needs testing, as I am not fully confident in the way Matlab interprets regular expressions.