arthurdejong / python-stdnum

A Python library to provide functions to handle, parse and validate standard numbers.
https://arthurdejong.org/python-stdnum/
GNU Lesser General Public License v2.1
495 stars 205 forks source link

Incorrect IMSI parsing #257

Closed david-l-anderson-dlande closed 3 years ago

david-l-anderson-dlande commented 3 years ago

When parsing some IMSIs (at least, from India), the split returns the wrong MNC, and gives a result which when concatenated back together is a digit too short.

Python 3.8.5 (default, Jan 27 2021, 15:41:15) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import stdnum.imsi
>>> stdnum.__version__
'1.16'
>>> stdnum.imsi.split('405034001111111')
('405', '03', '001111111')

Notice that the second '4' simply vanished.

hozn commented 3 years ago

In this particular case, it would appear that this is due to NumDB._merge prioritizing the shortest part length rather than the longest. Making the following change fixes this particular case, but I expect has undesired consequences elsewhere.

--- a/stdnum/numdb.py
+++ b/stdnum/numdb.py
@@ -126,7 +126,7 @@ class NumDB():
         for parts in zip(*results):
             # regroup parts into parts list and properties list
             partlist, proplist = list(zip(*(x for x in parts if x)))
-            part = min(partlist, key=len)
+            part = max(partlist, key=len)
             props = {}
             for p in proplist:
                 props.update(p)

(Indeed: the ISBN splitting tests fail with this change.)

arthurdejong commented 3 years ago

Thanks for your bug report. The underlying cause is that on Wikipedia for the MCC "405" both MNC "03" and "034" are listed and that the numdb module assumes that ranges are consistently assigned.

We probably need some kind of sanity checks in the generated files that are shipped as part of python-stdnum to detect these kind of issues before release. The numdb module also needs to behave consistently so that an inconsistent structure does not result in digits getting lost on splitting a number.

The question remains how to deal with this particular situation. The Wikipedia page for the MNC lists "034" as "Not operational" and "Merged with Airtel in 2019". We could leave all "Not operational" entries from the data file but that would also remove some historical data and make numbers that were valid before suddenly become invalid. In this case it is obvious that split() should return ('405', '03', '4001111111') which is "more correct" than ('405', '034', '001111111') but this is a tricky one to solve nicely.

arthurdejong commented 3 years ago

The fix for the last bit is implemented in b7901d6 by ignoring "Not operational" entries if they would conflict with other entries.

In 38c368d a consistency improvement in the numdb handling was made to ensure that if there is conflicting data at least one splitting is chosen and no digits should get lost.

The sanity check for the data files is still on my TODO list ;)

hozn commented 3 years ago

I believe in this case the IR.21 file from the operator actually lists 405 034 as a valid MCC, MNC pair. I'm not entirely sure how to deconflict these datasets, but it would appear that 405034 would be correct here.

image

image

I think what may be going on here, looking at some other datasets, has to do with the leading 0 that could be present in the MNC (and hence IMSI) from this operator. Other datasets show this PLMN as 40534. What a confusing world.

This document also suggests 405034 is correct: https://www.itu.int/dms_pub/itu-t/opb/sp/T-SP-E.212B-2018-PDF-E.pdf