issues
search
Unstructured-IO
/
unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k
stars
573
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
bug(html): <div> with both text and phrasing child breaks element at phrasing child
#3228
scanny
opened
1 week ago
0
bug(html): <br/> element breaks paragraph/document-element
#3227
scanny
opened
1 week ago
0
null <- Ingest test fixtures update
#3226
ryannikolaidis
closed
1 week ago
0
refactor: remove download packages step
#3225
christinestraub
closed
1 week ago
0
Roman/migrate es dest
#3224
rbiseck3
closed
1 week ago
0
KDB.AI as destination connector
#3223
alexgiannak
opened
1 week ago
1
Roman/fix ingest async connectors <- Ingest test fixtures update
#3222
ryannikolaidis
closed
1 week ago
0
DNM: Clone of 3108
#3221
MthwRobinson
closed
1 week ago
0
build(deps): version bumps for 2024-06-17
#3220
MthwRobinson
closed
2 weeks ago
0
null <- Ingest test fixtures update
#3219
ryannikolaidis
closed
1 week ago
0
rfctr(html): replace html parser
#3218
scanny
opened
2 weeks ago
6
Unstructured data extraction using Hi_res method takes 1 hour to extract data from a 700 pages pdf.
#3217
uprg
closed
2 weeks ago
3
Images in headers
#3216
famda
closed
57 minutes ago
1
rfctr(html): drop convert_and_partition_html()
#3215
scanny
closed
1 week ago
0
rfctr: chroma destination migrated to V2
#3214
potter-potter
closed
1 week ago
0
build: pull from wolfi base image
#3213
MthwRobinson
closed
2 weeks ago
0
FEAT: Astra DB Source Connector Support
#3212
erichare
opened
2 weeks ago
8
Roman/fix ingest async connectors <- Ingest test fixtures update
#3211
ryannikolaidis
closed
2 weeks ago
0
Roman/fix ingest async connectors
#3210
rbiseck3
closed
1 week ago
0
FIX: The <div> text element with one <br> will not be regarded as a text element by `_is_text_tag`
#3209
heya5
opened
2 weeks ago
0
Loading PDF for parsing takes about 2 minutes to complete. How can we speed up the process
#3208
shuaihutianxie
closed
57 minutes ago
1
rfctr(html): drop HTML-specific elements
#3207
scanny
closed
2 weeks ago
0
chore: bump unstructured-inference 0.7.35 <- Ingest test fixtures update
#3206
ryannikolaidis
closed
2 weeks ago
0
chore: bump unstructured-inference 0.7.35
#3205
christinestraub
closed
2 weeks ago
0
Feat/migrate elasticsearch src connector <- Ingest test fixtures update
#3204
ryannikolaidis
closed
2 weeks ago
0
Roman/dry ingest pipeline step
#3203
rbiseck3
closed
2 weeks ago
1
bug/<ocr-agent> call PartitionPdf error: no ocr_agent found
#3202
ZephryLiang
opened
2 weeks ago
6
fix: disable docker build cache
#3201
christinestraub
closed
2 weeks ago
0
ci: update to Node 20 actions
#3200
scanny
closed
2 weeks ago
0
feat/tqdm ingest support
#3199
rbiseck3
closed
2 weeks ago
3
feat/merge_tables_on_different pages
#3198
tanzeel291994
closed
26 minutes ago
1
build(deps): deltalake bump to `0.18.x`
#3197
MthwRobinson
closed
2 weeks ago
0
feat: table evaluations for fixed html table generation
#3196
pawel-kmiecik
closed
2 weeks ago
0
Feat/migrate elasticsearch src connector <- Ingest test fixtures update
#3195
ryannikolaidis
closed
2 weeks ago
0
feat/less strict Python version
#3193
egeres
closed
26 minutes ago
4
bug/unstructured.paddleocr is not compatible with GPU version of PaddleOCR
#3191
peixin-lin
opened
2 weeks ago
1
fix: dropbox source connector file path bugs <- Ingest test fixtures update
#3190
ryannikolaidis
closed
2 weeks ago
0
fix: dropbox source connector file path bugs
#3189
ryannikolaidis
closed
2 weeks ago
0
fix: relative path / permissions issues with v2 fsspec connectors <- Ingest test fixtures update
#3188
ryannikolaidis
closed
2 weeks ago
0
fix: relative path / permissions issues with v2 fsspec connectors
#3186
ryannikolaidis
closed
2 weeks ago
0
Feat/migrate elasticsearch src connector <- Ingest test fixtures update
#3185
ryannikolaidis
closed
2 weeks ago
0
rfctr(html): drop page concept
#3184
scanny
closed
2 weeks ago
0
Suggestion: include consolidated bounding box coordinates in chunk metadata when using "by_title" chunking strategy
#3194
NikitaKemarskiy
opened
2 weeks ago
1
fix: run `libreoffice` once during wolfi image build
#3183
MthwRobinson
closed
2 weeks ago
1
bug/pdf extraction error when strategy not set
#3187
pk-lit
closed
28 minutes ago
2
feat: skip ocr for certain element types (Issue #3163)
#3182
beez2022
opened
2 weeks ago
0
Unstructured paid API stuck again
#3181
Neel-132
closed
2 weeks ago
1
rfctr(html): break coupling to DocumentLayout
#3180
scanny
closed
2 weeks ago
1
FIX: Use proper Astra env vars name to match internal docs
#3179
erichare
closed
6 days ago
3
Sensitive data security issues
#3178
arkim822
closed
2 weeks ago
2
Previous
Next