bio-parsers: errors in genbankToJson

manulera commented 2 months ago

Hello @tnrich, I noticed there were a couple of errors on genbankToJson:

Incorrectly parsing composed features with single positions

Features like join(1,3..4) would be incorrectly parsed because the function used to read pairs of subsequent integers as start-end of locations

const locArr = [];
  locStr.replace(/(\d+)/g, function (string, match) {
    locArr.push(match);
  });

This would incorrectly interpret this feature as if it was join(1..3,4). A fix for this is proposed in the PR.

Origin-spanning features not being parsed correctly when reading gb files

This is a followup to #47.

According to gb rules, origin-spanning features described as a join (e.g. join(19..20,1) in a circular sequence of length 20, which is equivalent to {start: 18, end: 0} in tesela's json.

The previous fix from #47 was not enough, because when parseFeatureLocation is called inside genbankToJson, the sequence has not been parsed yet, so we cannot use the length of the sequence to know where the origin is.

What I have done is:

Create a separate function wrapOriginSpanningFeatures that takes as an input the .locations array and merges joins like join(19..20,1).
When calling genbankToJson, this function is called in endSeq > postProcessCurSeq > postProcessGenbankFeature, which seemed to make sense.
When using parseFeatureLocation as a standalone, if you pass the sequenceLength, wrapOriginSpanningFeatures is also called.

I have added tests for these cases as well.

Let me know if I should change something else.

tnrich commented 2 months ago

Looks good to me, thanks @manulera !

tnrich commented 2 months ago

Merged and published!

manulera commented 1 month ago

@tnrich where is this published? I don't see a new release for @teselagen/bio-parsers

https://www.npmjs.com/package/bio-parsers?activeTab=versions

tnrich commented 1 month ago

@manulera hmm you're right. I'll look into actually getting it published hah

tnrich commented 1 month ago

@manulera https://www.npmjs.com/package/@teselagen/bio-parsers seems like it is publishing fine. I think you're looking at the old deprecated version of bio-parsers

I'll try to update that one so it is clearer that it is no longer in use.

tnrich commented 1 month ago

@manulera ok, deprecated that package on npm so it will hopefully be clearer in the future!

TeselaGen / tg-oss

bio-parsers: errors in genbankToJson #71

Incorrectly parsing composed features with single positions

Origin-spanning features not being parsed correctly when reading gb files