dglazkov / polymath

MIT License
132 stars 9 forks source link

Smarten up substack import #28

Closed dglazkov closed 1 year ago

dglazkov commented 1 year ago

Currently, it simply chunks text content of all nodes of the HTML content. This results in many empty and short strings. For example, titles get separated from the respective articles into two distinct chunks.

Ideally, that would not happen. When importing, there needs to be some awareness of the sequence and tags: headings have the respective bodies appended. Maybe even each paragraph?

dglazkov commented 1 year ago

e28d202c8d9ad950f36ba9b737d103f1a46b3c1f makes progress toward this

dglazkov commented 1 year ago

Fixed in ec8e5739fe547b98fbdf529b7e936273dfce6f68, c21d0eea3a1d9da41ddb4f07b81ed0fc8e308e72, be436c67a75610350ea46c622ab6992b59b76a01 and fcdd403de5ca17254dded8bece7b23eb33cb9818