MIT-LCP / wfdb-python

Native Python WFDB package
MIT License
738 stars 300 forks source link

Optimization of wfdb.io.annotation.field2bytes function #406

Closed Fegalf closed 2 years ago

Fegalf commented 2 years ago

Hi,

I noticed writing an annotation file was slow for a file with many annotations. Running line-profiling on writing functions, I found out that the field2bytes function was taking up most of the execution time.

So, it turns out that the problem was with this line: typecode = ann_label_table.loc[ann_label_table["symbol"] == value[1], "label_store"].values[0]

What happened was that we filtered through all the ann_label_table DataFrame for every input value of field2bytes, so this was pretty slow. Instead, I added a dictionnary that maps every symbols to its corresponding label, which is much faster (see the time profiler output below)

Time profilers

Current version

Total time: 86.361 s
File: wfdb/io/annotation.py
Function: field2bytes at line 1602

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1602                                           @profile
  1603                                           def field2bytes(field, value):
  1604                                               """
  1605                                               Convert an annotation field into bytes to write.
  1606                                           
  1607                                               Parameters
  1608                                               ----------
  1609                                               field : str
  1610                                                   The annotation field of the value to be converted to bytes.
  1611                                               value : list
  1612                                                   The value to be converted to bytes.
  1613                                           
  1614                                               Returns
  1615                                               -------
  1616                                               data_bytes : list, ndarray
  1617                                                   All of the bytes to be written to the annotation file.
  1618                                           
  1619                                               """
  1620    361156     273292.0      0.8      0.3      data_bytes = []
  1621                                           
  1622                                               # samp and sym bytes come together
  1623    361156     248245.0      0.7      0.3      if field == "samptype":
  1624                                                   # Numerical value encoding annotation symbol
  1625    179612   83467815.0    464.7     96.6          typecode = ann_label_table.loc[ann_label_table["symbol"] == value[1], "label_store"].values[0]
  1626                                                   #typecode = typecodes[value[1]]
  1627                                                   # sample difference
  1628    179612     236106.0      1.3      0.3          sd = value[0]
  1629                                           
  1630    179612     131775.0      0.7      0.2          data_bytes = []
  1631                                           
  1632                                                   # Add SKIP element(s) if the sample difference is too large to
  1633                                                   # be stored in the annotation type word.
  1634                                                   #
  1635                                                   # Each SKIP element consists of three words (6 bytes):
  1636                                                   #  - Bytes 0-1 contain the SKIP indicator (59 << 10)
  1637                                                   #  - Bytes 2-3 contain the high 16 bits of the sample difference
  1638                                                   #  - Bytes 4-5 contain the low 16 bits of the sample difference
  1639                                                   # If the total difference exceeds 2**31 - 1, multiple skips must
  1640                                                   # be used.
  1641    181444     255089.0      1.4      0.3          while sd > 1023:
  1642      1832       3423.0      1.9      0.0              n = min(sd, 0x7FFFFFFF)
  1643      1832        915.0      0.5      0.0              data_bytes += [
  1644      1832        931.0      0.5      0.0                  0,
  1645      1832        916.0      0.5      0.0                  59 << 2,
  1646      1832       2251.0      1.2      0.0                  (n >> 16) & 255,
  1647      1832       1563.0      0.9      0.0                  (n >> 24) & 255,
  1648      1832       1583.0      0.9      0.0                  (n >> 0) & 255,
  1649      1832       2294.0      1.3      0.0                  (n >> 8) & 255,
  1650                                                       ]
  1651      1832       1957.0      1.1      0.0              sd -= n
  1652                                           
  1653                                                   # Annotation type itself is stored as a single word:
  1654                                                   #  - bits 0 to 9 store the sample difference (0 to 1023)
  1655                                                   #  - bits 10 to 15 store the type code
  1656    179612     442489.0      2.5      0.5          data_bytes += [sd & 255, ((sd & 768) >> 8) + 4 * typecode]
  1657                                           
  1658    181544     100423.0      0.6      0.1      elif field == "num":
  1659                                                   # First byte stores num
  1660                                                   # second byte stores 60*4 indicator
  1661                                                   data_bytes = [value, 240]
  1662    181544      95246.0      0.5      0.1      elif field == "subtype":
  1663                                                   # First byte stores subtype
  1664                                                   # second byte stores 61*4 indicator
  1665      1932       1299.0      0.7      0.0          data_bytes = [value, 244]
  1666    179612      95012.0      0.5      0.1      elif field == "chan":
  1667                                                   # First byte stores num
  1668                                                   # second byte stores 62*4 indicator
  1669                                                   data_bytes = [value, 248]
  1670    179612     107277.0      0.6      0.1      elif field == "aux_note":
  1671                                                   # - First byte stores length of aux_note field
  1672                                                   # - Second byte stores 63*4 indicator
  1673                                                   # - Then store the aux_note string characters
  1674    179612     531112.0      3.0      0.6          data_bytes = [len(value), 252] + [ord(i) for i in value]
  1675                                                   # Zero pad odd length aux_note strings
  1676    179612     150545.0      0.8      0.2          if len(value) % 2:
  1677                                                       data_bytes.append(0)
  1678                                           
  1679    361156     209407.0      0.6      0.2      return data_bytes

New version

Total time: 2.40503 s
File: /home/nicolasbg/miniconda3/envs/physionet/lib/python3.7/site-packages/wfdb/io/annotation.py
Function: field2bytes at line 1602

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1602                                           @profile
  1603                                           def field2bytes(field, value):
  1604                                               """
  1605                                               Convert an annotation field into bytes to write.
  1606                                           
  1607                                               Parameters
  1608                                               ----------
  1609                                               field : str
  1610                                                   The annotation field of the value to be converted to bytes.
  1611                                               value : list
  1612                                                   The value to be converted to bytes.
  1613                                           
  1614                                               Returns
  1615                                               -------
  1616                                               data_bytes : list, ndarray
  1617                                                   All of the bytes to be written to the annotation file.
  1618                                           
  1619                                               """
  1620    361156     199665.0      0.6      8.3      data_bytes = []
  1621                                           
  1622                                               # samp and sym bytes come together
  1623    361156     213260.0      0.6      8.9      if field == "samptype":
  1624                                                   # Numerical value encoding annotation symbol
  1625    179612     121482.0      0.7      5.1          typecode = typecodes[value[1]]
  1626                                                   # sample difference
  1627    179612     100643.0      0.6      4.2          sd = value[0]
  1628                                           
  1629    179612      97847.0      0.5      4.1          data_bytes = []
  1630                                           
  1631                                                   # Add SKIP element(s) if the sample difference is too large to
  1632                                                   # be stored in the annotation type word.
  1633                                                   #
  1634                                                   # Each SKIP element consists of three words (6 bytes):
  1635                                                   #  - Bytes 0-1 contain the SKIP indicator (59 << 10)
  1636                                                   #  - Bytes 2-3 contain the high 16 bits of the sample difference
  1637                                                   #  - Bytes 4-5 contain the low 16 bits of the sample difference
  1638                                                   # If the total difference exceeds 2**31 - 1, multiple skips must
  1639                                                   # be used.
  1640    181444     147706.0      0.8      6.1          while sd > 1023:
  1641      1832       2554.0      1.4      0.1              n = min(sd, 0x7FFFFFFF)
  1642      1832        986.0      0.5      0.0              data_bytes += [
  1643      1832        991.0      0.5      0.0                  0,
  1644      1832        968.0      0.5      0.0                  59 << 2,
  1645      1832       1856.0      1.0      0.1                  (n >> 16) & 255,
  1646      1832       1570.0      0.9      0.1                  (n >> 24) & 255,
  1647      1832       1548.0      0.8      0.1                  (n >> 0) & 255,
  1648      1832       2074.0      1.1      0.1                  (n >> 8) & 255,
  1649                                                       ]
  1650      1832       1556.0      0.8      0.1              sd -= n
  1651                                           
  1652                                                   # Annotation type itself is stored as a single word:
  1653                                                   #  - bits 0 to 9 store the sample difference (0 to 1023)
  1654                                                   #  - bits 10 to 15 store the type code
  1655    179612     253318.0      1.4     10.5          data_bytes += [sd & 255, ((sd & 768) >> 8) + 4 * typecode]
  1656                                           
  1657    181544     100786.0      0.6      4.2      elif field == "num":
  1658                                                   # First byte stores num
  1659                                                   # second byte stores 60*4 indicator
  1660                                                   data_bytes = [value, 240]
  1661    181544      99500.0      0.5      4.1      elif field == "subtype":
  1662                                                   # First byte stores subtype
  1663                                                   # second byte stores 61*4 indicator
  1664      1932       1163.0      0.6      0.0          data_bytes = [value, 244]
  1665    179612      98431.0      0.5      4.1      elif field == "chan":
  1666                                                   # First byte stores num
  1667                                                   # second byte stores 62*4 indicator
  1668                                                   data_bytes = [value, 248]
  1669    179612     102374.0      0.6      4.3      elif field == "aux_note":
  1670                                                   # - First byte stores length of aux_note field
  1671                                                   # - Second byte stores 63*4 indicator
  1672                                                   # - Then store the aux_note string characters
  1673    179612     541168.0      3.0     22.5          data_bytes = [len(value), 252] + [ord(i) for i in value]
  1674                                                   # Zero pad odd length aux_note strings
  1675    179612     120971.0      0.7      5.0          if len(value) % 2:
  1676                                                       data_bytes.append(0)
  1677                                           
  1678    361156     192616.0      0.5      8.0      return data_bytes