Anush008 / fastembed-go

Go implementation of @qdrant/fastembed.
https://pkg.go.dev/github.com/anush008/fastembed-go
MIT License
49 stars 4 forks source link

tokenizer nil pointer dereference error with specific input text #13

Open alkuma opened 1 month ago

alkuma commented 1 month ago

I am getting a nil pointer error with specific texts, I created a test at https://github.com/alkuma/tokenizerissue to demonstrate the issue.

There are two strings that are being embedded, the first one goes thru, but the second one fails.

Here is the output of the program:

/usr/local/go/bin/go tool test2json -t /home/alok/.cache/JetBrains/GoLand2024.2/tmp/GoLand/___1TestEmbedding_in_tokenizerissue.test -test.v=test2json -test.paniconexit0 -test.run ^\QTestEmbedding\E$
2024/09/17 15:19:17 INFO: CachedDir="/home/alok/.cache/tokenizer"
=== RUN   TestEmbedding
AS YOU LIKE IT
DRAMATIS PERSONAE
DUKE SENIOR     living in banishment.
DUKE FREDERICK  his brother, an usurper of his dominions.
AMIENS  | |  lords attending on the banished duke. JAQUES       |
LE BEAU a courtier attending upon Frederick.
CHARLES wrestler to Frederick.
OLIVER          | | JAQUES (JAQUES DE BOYS:)    |  sons of Sir Rowland de Boys. | ORLANDO               |
ADAM    | |  servants to Oliver. DENNIS |
TOUCHSTONE      a clown.
SIR OLIVER MARTEXT      a vicar.
CORIN   |
|  shepherds.
SILVIUS |
WILLIAM a country fellow in love with Audrey.
A person representing HYMEN. (HYMEN:)
ROSALIND        daughter to the banished duke.
CELIA   daughter to Frederick.
673
AS YOU LIKE IT
DRAMATIS PERSONAE
DUKE SENIOR     living in banishment.
DUKE FREDERICK  his brother, an usurper of his dominions.
AMIENS  | |  lords attending on the banished duke. JAQUES       |
LE BEAU a courtier attending upon Frederick.
CHARLES wrestler to Frederick.
OLIVER          | | JAQUES (JAQUES DE BOYS:)    |  sons of Sir Rowland de Boys. | ORLANDO               |
ADAM    | |  servants to Oliver. DENNIS |
TOUCHSTONE      a clown.
SIR OLIVER MARTEXT      a vicar.
CORIN   |
|  shepherds.
SILVIUS |
WILLIAM a country fellow in love with Audrey.
A person representing HYMEN. (HYMEN:)
ROSALIND        daughter to the banished duke.
CELIA   daughter to Frederick.

673
[[-0.03914698 -0.016538322 0.023178136 0.011642908 0.0564912 0.007894647 0.025433742 0.035716955 -0.019091763 -0.020351494 -0.01514089 0.012821367 -0.054411728 0.032000016 -0.018131282 0.08451082 0.049022898 0.017406626 0.017041788 -0.04623328 0.035323188 0.032879945 -0.00037763309 0.06970637 0.043458235 0.016123137 0.018936096 0.02063325 -0.05700099 -0.025971899 0.035846584 -0.011900318 -0.019333359 -0.02900332 -0.008514504 -0.04624489 -0.014379387 -0.020486495 0.011699142 0.006409424 -0.07082859 -0.013937539 0.005432031 -0.005396163 -0.06794449 -0.010348638 -0.036275588 0.017526548 -0.0058838953 -0.024287518 -0.048917953 0.039226435 -0.009778961 0.012310327 -0.023821943 0.022571076 -0.032029998 -0.04378382 0.028337928 -0.05277187 0.0017444869 0.006742158 0.0372529 0.0010056419 -0.00060814136 0.006685288 1.7267339e-05 0.06339974 -0.059310623 -0.021789877 -0.015668927 -0.0010967483 0.0025227757 -0.008613313 0.027283266 -0.063864276 -0.004863343 0.027290193 0.056868467 0.023454076 -0.016763298 -0.0054399003 0.006188793 0.048575606 -0.026876703 -0.03610761 -0.00043335767 0.0003789369 -0.081276685 0.039349385 -0.008940972 -0.049587313 0.03761981 0.042609006 -0.0079110265 -0.0035721026 0.051356286 -0.0044129933 0.0015475124 0.021560663 -0.031360585 -0.038248923 0.016443029 -0.00034044942 -0.09568959 -0.01081435 -0.009083789 0.03689808 0.0075237486 0.027368983 -0.005172214 -0.009384759 0.009782874 0.021921514 -0.045917857 0.06161481 0.043410208 -0.013445491 -0.0077494937 0.023404986 0.06215935 0.020469893 -0.0072953156 0.10336666 0.024090728 0.020731067 -0.011793335 0.05349762 -0.013003309 -0.06273616 0.005801809 0.05778524 -0.01770478 -0.010948908 -0.001877506 -0.042103637 0.014173885 -0.0043255757 0.035545927 0.010555955 0.0022608016 -0.01118194 0.013429064 0.020862108 0.043351796 -0.030716933 -0.017455425 -0.041980993 0.012616989 0.048666902 -0.012770984 0.031477388 0.030343024 -0.036949676 0.012917157 0.058238946 0.029963208 0.026458332 -0.021398345 0.02069551 0.03828373 0.027362816 0.01144157 0.01773172 0.022156883 -0.010513427 0.0060478807 0.01718434 -0.04091684 0.021870496 -0.072977014 -0.016520886 0.06187403 -0.041432407 -0.018332282 0.039845582 0.09559854 0.041445937 0.04684613 -0.0069881086 -0.07639642 0.040348083 0.02335101 -0.0046012397 0.026520465 -0.007890131 0.052469924 0.010194702 0.014858035 0.03759209 -0.06054353 -0.064428255 -0.02382746 0.0030103163 0.045369398 -0.019959413 -0.004659652 0.03708061 -0.0038971528 0.06279973 -0.012340761 0.020642685 0.04060641 -0.006453256 -0.061737575 0.018255593 -0.001301643 0.024874456 0.032203175 -0.011828143 -0.03851288 -0.012062547 0.0664155 -0.049623474 0.02326029 -0.015502906 0.052531365 -0.037508376 0.017440705 -0.0735822 0.025554365 -0.012990718 -0.041517846 0.023062894 -0.024853319 0.100043885 0.056865305 -0.05963884 -0.04027259 0.024926204 -0.01888787 -0.025096748 -0.0013074251 -0.01325122 -0.010748644 0.011728527 0.004855801 -0.046975892 0.03411985 -0.056537498 0.0056181317 0.053715814 0.011858979 0.079618104 0.017376544 0.01665108 0.034709867 0.0006663871 -0.056170613 -0.02711519 -0.0014543701 -0.03524299 0.0075247423 -0.022341667 0.008779559 -0.05686332 -0.032249086 0.049802165 0.03996286 0.05161114 -0.042233754 0.014376176 -0.016475571 -0.018463984 -0.013941526 -0.036131732 -0.037772164 -0.012133741 0.033861097 -0.005092063 0.02904495 -0.002741557 -0.0012500169 0.004346772 0.005123095 8.7455184e-05 0.072036505 0.00032354923 0.024320055 -0.039208207 -0.01390895 0.074305646 -0.047137924 -0.03887461 0.001901088 -0.10452757 0.03541344 -0.051450636 -0.039202023 0.0037135687 0.038421314 0.037239667 0.030913997 0.00741533 0.03195537 0.00699422 0.0046604634 0.035519995 -0.015194695 -0.0059102173 -0.0125123635 -0.0060820356 0.013914759 -0.0015158656 0.02122563 -0.02741586 0.0085247895 -0.031034654 -0.26160786 0.0021040207 0.034374084 -0.040845644 0.049236394 -0.019883346 0.040674936 -0.03596126 -0.05063188 0.035107706 -0.029123846 -0.0457412 0.010796176 0.042915713 0.039227314 0.015756665 -0.018854285 -0.045812745 -0.009172312 0.037980657 -0.021215655 -0.054714758 -0.030052204 0.024671923 0.025940834 0.059799552 -0.050287012 -0.0030677565 -0.086370826 -0.03388499 0.0021782645 0.0038881723 0.033259008 -0.015950117 -0.0035676898 -0.041105423 0.04366649 -0.0068972823 -0.021965034 0.0011830716 -0.02629944 -0.044561606 -0.023651939 0.009472122 0.05867902 -0.016693924 -0.06414829 -0.0066306265 -0.03840866 0.06758065 -0.02788127 -0.011591923 -0.005063631 0.002926456 -0.0056525413 0.0028303913 -0.0055221547 0.01315445 -0.06290255 -0.04398332 -0.012094841 -0.04480738 -0.041383084 -0.023820942 -0.008420631 -0.057843395 -0.04899028 -0.01342248 0.09446904 0.038170658 -0.040533535 -0.007015521 0.01192337 -0.08365915 0.0017968907 -0.0025380466 -0.009427974 -0.009932006 0.00026163805 0.02594476 -0.030752674 -0.026657093 0.021098124 0.008863274 -0.006488929 -0.03697985 0.023044562 -0.02419331 -0.036591124 -0.024120301 0.06960363 -0.010372081 -0.025158368 -0.013693026 0.01300504 0.02227767 -0.0015247545 -0.015730513 0.0238872 0.01825556 0.0370508 -0.074274346 0.033484608 -0.0060399654 0.0067823497 -0.0060035777 -0.05207626 0.029591309 0.03991352 0.017776724 0.056803543 -0.0036727912 0.034457386 -0.046009373 0.00023433224 -0.071260884 0.02851205 0.07166555 0.0063079665 0.038949873 -0.05573132 0.041894786 -0.036953613 -0.020935554 -0.0922639 -0.012961587 0.009381917 -0.011597907 -0.019261444 -0.00639877 -0.004511787 -0.0033551345 0.027393656 -0.024261534 0.017353045 0.00080475234 -0.05555794 -0.052705985 -0.0014381633 0.0018840844 0.021800129 0.010827761 -0.0063026166 0.03353285 0.044599503 0.0077848737 -0.0029283045 -0.00039049625 0.018483976 0.035987716 0.005219433 0.003641071 0.030400632 -0.059704714 -0.021531524 -0.032892182 0.013581656 -0.006007797 0.008786557 -0.02286594 -0.02111237 -0.04407928 -0.025530605 -0.0068782447 0.0074550346 0.062660806 -0.010601268 -0.010685531 -0.015256402 0.019312108 0.025710458 0.014006963 -0.045301154 0.01740028 -0.009736621 0.0066993353 -0.022136973 0.013612366 0.05849686 0.029680526 0.001417695 -0.03254062 -0.0018819447 0.0041718297 0.06276969 0.035705727 0.005127659 -0.06511382 -0.0036923448 -0.0047796667 -0.0006886609 0.028202135 -0.03349943 0.013126994 -0.057374008 -0.07305299 0.02789206 -0.0026524563 0.024118802 0.010876676 0.016884591 -0.006562245 -0.04699496 0.028407542 0.043413766 -0.072359815 0.061121542 0.0023021614 -0.009506745 0.017742652 -0.011882974 -0.051569894 -0.0032277745 0.013072393 0.0252644 -0.06367772 -0.012006346 -0.039752934 0.016992357 -0.01946568 0.017556485 -0.039766937 -0.015146741 0.0043553817 -0.03300536 0.041409392 -0.029696869 -0.034427825 0.03265753 -0.033445444 0.029599441 -0.015332254 0.0038055116 0.04395136 -0.019857742 -0.0037471876 -0.019987168 -0.027075827 0.0051693665 0.057406757 0.033968635 0.018858982 -0.032702416 -0.02568262 -0.015521807 0.02559059 0.011727608 -0.017817227 0.0022101407 0.04306708 0.0001521992 -0.002650939 -0.021742256 -0.012054737 0.068472214 -0.047306042 -0.014674873 0.017066197 -0.051577978 0.030212536 0.002544334 0.02917181 -0.019093212 0.02930066 0.05152553 0.009152614 0.029787736 0.0011963875 0.052472897 -0.0361598 0.00010058674 -0.06904818 0.016232267 -0.0039677448 0.011245551 0.013937295 -0.015575298 -0.046503574 0.06782438 -0.08391851 -0.026548455 0.04568559 -0.030084113 0.010012481 0.020641306 -0.069049835 0.0027308327 0.021092122 -0.03908603 0.0064549767 0.014999664 0.052215375 0.0031571654 0.02453982 0.015449896 -0.009599123 0.054865893 0.038270622 0.008379506 0.05169393 -0.0635431 0.05361065 0.027451267 -0.02504078 -0.0318296 0.021326253 -0.008771796 -0.07166529 0.0046098814 0.008210814 -0.012494197 -0.07983677 0.0322951 0.016638167 -0.027372014 -0.04498509 -0.0115331495 -0.026469693 -0.03370635 0.000676141 0.011307931 -0.011655599 0.06414379 0.018598035 0.025064886 0.063107245 -0.017471809 0.037015863 -0.0041355346 0.09167845 0.06278827 0.049575448 -0.032504965 0.094415836 -0.0070365896 -0.06828078 0.03029201 0.03385621 -0.023417555 -0.019534213 0.008425382 0.058012586 0.0021701755 0.050336093 -0.013609865 -0.011643509 -0.0058129276 -0.0142343035 0.04619372 0.015765378 0.028137436 0.038674865 0.018905077 -0.06938297 0.039243255 0.020575562 -0.027785309 0.0044124466 -0.041977398 0.033078786 0.0023755538 0.0013827555 0.080165684 0.021713875 -0.008895852 0.010854239 0.030240793 0.010076886 -0.0068736626 -0.010659401 0.0091342125 -0.016192537 -0.03269065 0.0015859033 0.014045188 -0.005773467 0.025777139 -0.03233787 0.0020606334 0.022983052 0.036939822 -0.043826174 -0.04531051 -0.052388918 -0.048537176 -0.05221436 -0.023132278 -0.008065607 -0.041005827 -0.048821874 -0.018616289 -0.036834672 -0.0131818615 0.00032311416 -0.0608724 -0.0473172 0.017388172 0.03620469 0.016872536 0.009612658 0.06283182 0.0266591 -0.0407606 -0.018680993 0.009808718 0.045869667 0.0017224478 0.020221831 -0.106909215 0.032913286 0.045634817 -0.011272518 -0.07594389 0.03301969 -0.014931814 -0.03439635 0.051964276 0.014607602 -0.0019748472 -0.031476032 -0.014223328 0.0025003545 0.010445406 0.049866706 -0.060485397 0.08876377 0.033138666 0.01942703 -0.052508734 0.015518047 0.0050181053 0.023438185 -0.06435748 -0.007261127 -0.009940068 -0.08559045 -0.02445086 0.01683098 -0.041163374 -0.044273637 0.017937073 -0.023909848 0.0026623239 0.019933624 -0.022201682 -0.029950371 -0.032257035 -0.0068081166 -0.043268044 0.032621004 0.02144448 -0.0013739939 0.019817922 -0.052019957 -0.0036603028 -0.009124586 -0.009007775 0.01633006 0.0038869274 0.010353903]]
AS YOU LIKE IT
DRAMATIS PERSONAE
DUKE SENIOR     living in banishment.
DUKE FREDERICK  his brother, an usurper of his dominions.
AMIENS  | |  lords attending on the banished duke. JAQUES       |
LE BEAU a courtier attending upon Frederick.
CHARLES wrestler to Frederick.
OLIVER          | | JAQUES (JAQUES DE BOYS:)    |  sons of Sir Rowland de Boys. | ORLANDO               |
ADAM    | |  servants to Oliver. DENNIS |
TOUCHSTONE      a clown.
SIR OLIVER MARTEXT      a vicar.
CORIN   |
|  shepherds.
SILVIUS |
WILLIAM a country fellow in love with Audrey.
A person representing HYMEN. (HYMEN:)
ROSALIND        daughter to the banished duke.
CELIA   daughter to Frederick.
PHEBE   a shepherdess.
AUDREY  a country wench.
Lords, pages, and attendants, &c. (Forester:) (A Lord:) (First Lord:) (Second Lord:) (First Page:) (Second Page:)
835
AS YOU LIKE IT
DRAMATIS PERSONAE
DUKE SENIOR     living in banishment.
DUKE FREDERICK  his brother, an usurper of his dominions.
AMIENS  | |  lords attending on the banished duke. JAQUES       |
LE BEAU a courtier attending upon Frederick.
CHARLES wrestler to Frederick.
OLIVER          | | JAQUES (JAQUES DE BOYS:)    |  sons of Sir Rowland de Boys. | ORLANDO               |
ADAM    | |  servants to Oliver. DENNIS |
TOUCHSTONE      a clown.
SIR OLIVER MARTEXT      a vicar.
CORIN   |
|  shepherds.
SILVIUS |
WILLIAM a country fellow in love with Audrey.
A person representing HYMEN. (HYMEN:)
ROSALIND        daughter to the banished duke.
CELIA   daughter to Frederick.
PHEBE   a shepherdess.
AUDREY  a country wench.
Lords, pages, and attendants, &c. (Forester:) (A Lord:) (First Lord:) (Second Lord:) (First Page:) (Second Page:)

835
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x730074]

goroutine 35 [running]:
github.com/sugarme/tokenizer.(*Encoding).GetIds(...)
    /path/to/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.3-0.20230829214935-448e79b1ed65/encoding.go:215
github.com/sugarme/tokenizer.TruncateEncodings(0xc0009d2d00, 0x0, 0xc0009d2c30?)
    /path/to/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.3-0.20230829214935-448e79b1ed65/util.go:108 +0x54
github.com/sugarme/tokenizer.(*Tokenizer).PostProcess(0xc000465680, 0xc0009d2d00?, 0x0?, 0x1)
    /path/to/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.3-0.20230829214935-448e79b1ed65/tokenizer.go:602 +0xe5
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0xc000465680, {0x7a1e20, 0xc0003121e0}, 0x1)
    /path/to/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.3-0.20230829214935-448e79b1ed65/tokenizer.go:464 +0x2e5
github.com/sugarme/tokenizer.(*Tokenizer).EncodeBatch.func1(0x0)
    /path/to/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.3-0.20230829214935-448e79b1ed65/tokenizer.go:647 +0x90
created by github.com/sugarme/tokenizer.(*Tokenizer).EncodeBatch in goroutine 34
    /path/to/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.3-0.20230829214935-448e79b1ed65/tokenizer.go:644 +0xf5

Process finished with the exit code 1

The first chunk has 634 characters and the embedding is successful. The next chunk has 835 characters (ie the first 634 characters and an additional 201 characters beyond that) and it fails with the tokenizer nil pointer dereference error.

Has anybody faced this before, is it a known issue, and if so is there a way to work around it?

Please let me know if any additional information is required.

To execute the tests, follow these steps

  1. git clone the https://github.com/alkuma/tokenizerissue repository
  2. set the value of ONNX_PATH to the correct value
  3. simply run the test called TestEmbedding which is present in the file embed_test.go and you should get the error
alkuma commented 1 month ago

Since there was a similar issue reported (and closed via a code change / PR) on the tokenizer side I just forked both tokenizer and fastemebed-go and published the latest master / main branch and used them as dependency, and the error is gone.

Perhaps all that's needed to be done is to publish the latest versions of both?

Anush008 commented 1 month ago

@alkuma, I'd recommend you keep your project dependent on your fork. It gives you the flexibility to add any changes. As I can see, both fastembed-go and https://github.com/sugarme/tokenizer aren't under active maintenance.