issues
search
PlanTL-GOB-ES
/
corpus-cleaner
Generic toolkit for corpus cleaning
MIT License
5
stars
0
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Bump redis from 3.4.1 to 4.4.4
#102
dependabot[bot]
opened
1 year ago
0
Bump redis from 3.4.1 to 4.5.3
#101
dependabot[bot]
closed
1 year ago
1
Bump protobuf from 3.20.0 to 3.20.2
#100
dependabot[bot]
closed
1 year ago
1
Bump joblib from 0.14.1 to 1.2.0
#99
dependabot[bot]
closed
1 year ago
0
Fix onion dedup branch merge to master
#98
ccasimiro88
closed
1 year ago
0
Bump numpy from 1.18.2 to 1.22.0
#97
dependabot[bot]
closed
1 year ago
0
Line breaks in biomedical corpus
#96
jordiae
opened
2 years ago
0
Update par_utils.py
#95
jordiae
closed
1 year ago
0
Master -> refactor
#94
jordiae
closed
1 year ago
0
Port corpus-cleaner to Distify
#93
jordiae
opened
2 years ago
0
Fix onion dedup
#92
jordiae
closed
2 years ago
0
Dedup bne
#91
jordiae
closed
2 years ago
0
Free nodes when they have finished cleaning and rebalance the load
#90
asier-gutierrez
opened
3 years ago
0
BNE Gzip parsing is wrong and dramatically slows the execution
#89
asier-gutierrez
opened
3 years ago
0
Verify (and fix) potential CPU overloading
#88
asier-gutierrez
opened
3 years ago
0
Master node's RAM allocation is larger
#87
asier-gutierrez
opened
3 years ago
0
RAM Memory Leak or Unexpected Memory Allocation
#86
asier-gutierrez
opened
3 years ago
0
BSC Crawl data parser: url = url & keywords = url?
#85
asier-gutierrez
opened
3 years ago
0
Error: "Processed Expecting value: line 1..." when processing large corpus.
#84
asier-gutierrez
opened
3 years ago
0
Check "Processed None(?) into OutputFormatterMapper" message
#83
asier-gutierrez
opened
3 years ago
0
Tackle distributed computing limitation with 50<x<600 nodes
#82
asier-gutierrez
opened
3 years ago
1
Accept PDF format
#81
onadegibert
opened
3 years ago
0
Refactoring roadmap
#80
jordiae
opened
3 years ago
0
None filter not working
#79
asier-gutierrez
opened
3 years ago
1
no filter
#78
jordiae
closed
3 years ago
0
OOM error while processing big file (150GB)
#77
onadegibert
opened
3 years ago
0
Roadmap v2
#76
jordiae
opened
3 years ago
1
Rethink deduplication
#75
jordiae
opened
3 years ago
0
Add concat.py script
#74
jordiae
closed
3 years ago
1
Fix new deploy: few onions and slow speed
#73
onadegibert
opened
3 years ago
0
New cleaner
#72
onadegibert
closed
3 years ago
0
Fixed checkpoint
#71
jordiae
closed
3 years ago
0
Added timeout to encoding guessing
#70
jordiae
closed
3 years ago
0
New checkpointing
#69
jordiae
closed
3 years ago
0
remove format-control letters before to apply the global sentence deduplication
#68
ccasimiro88
closed
3 years ago
0
fix uncorrect ";"
#67
ccasimiro88
closed
3 years ago
0
prevent string format-control errors during global sentence deduplication
#66
ccasimiro88
closed
3 years ago
0
Warnings
#65
ccasimiro88
closed
3 years ago
0
Fixed global dedup
#64
ccasimiro88
closed
3 years ago
0
fix deduplication counting only the sentences in non duplicated documents
#63
ccasimiro88
closed
3 years ago
0
fix global sentence deduplication
#62
ccasimiro88
closed
3 years ago
0
Progress logging
#61
ccasimiro88
closed
3 years ago
0
Debug mode
#60
ccasimiro88
closed
3 years ago
0
implement FileParser class for file input format
#59
ccasimiro88
closed
3 years ago
0
Debug cleaning errors
#58
ccasimiro88
closed
3 years ago
0
Fix apostrophes ca
#57
onadegibert
closed
3 years ago
0
Random sampling
#56
ccasimiro88
closed
3 years ago
0
Fixed symlinks
#55
jordiae
closed
3 years ago
0
Only sentence splitting
#54
jordiae
closed
3 years ago
0
Buffered read
#53
jordiae
closed
3 years ago
0
Next