aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

.html() function to support skipping certain blocks #179

Closed rsanaie closed 2 months ago

rsanaie commented 3 months ago

When using the .html() function, certain blocks aren't necessary and should be skipped, such as page number, header/footer. Is there a way we can specify skipping these blocks?

Thanks

athewsey commented 3 months ago

Sure, this makes sense to me - at the moment these methods would be annoyingly complex to re-implement in user code, but don't really provide any configurability for filtering content. We still have some TODOs to update the old getLinesByLayoutArea, getFooterLines and getHeaderLines heuristic methods to play nicely with Layout where available, also...

I'm thinking something similar to the (recently-introduced) IBlockTypeFilterOpts, like below?:

page.html({
  skipBlockTypes: [ApiBlockType.LayoutHeader, ApiBlockType.LayoutFooter],
});

Would block type be sufficient for your use-case? Or do you think you'd need to be able to pull out individual instances with e.g. skipBlockIds as well?

rsanaie commented 3 months ago

IBlockTypeFilterOpts work, I don't need to pick out specific IDs

athewsey commented 3 months ago

OK so the good news is I've been able to get a scrappy v0.4.2-alpha.1 pre-release out already where the above should work...

...But the bad news is there's probably a fair bit more to figure out & harden before it could go to mainline release. Today the filter options on html() only work properly with Layout* block types, and only the Layout* items (plus Page and TextractDocument) support passing the options in. I'd like to make a more general extension to enable full IBlockTypeFilterOpts across all IRenderables, but that'll probably take a while to work through.

If you manage to try out the alpha and have any feedback though, it'd be great to hear! Maybe it can enable your use-case in the short term at least

athewsey commented 2 months ago

Hi @rsanaie - I just pushed v0.4.2-alpha.3, which I think should work functionally pretty much the same as the last one but with less ugliness under-the-hood.

Any chance you'd have some time to try it out and double-check it doesn't break anything before we go ahead and push to a mainline release?

athewsey commented 2 months ago

v0.4.2 is now released so closing this issue. Thanks for raising!