JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks
Other
96 stars 25 forks source link

Adding GPT2 Tokenizer for WordTokenizers' Pretrained tokenizers #61

Open shikhargoswami opened 3 years ago

shikhargoswami commented 3 years ago

Hello everyone, This is a PR for adding GPT2 tokenizer in extending pretrained tokenizers in WordTokenizers.jl. This might be helpful in future if developing end-to-end pipeline on top of GPT2 model in Julia. Though I have added tests, suggestions/corrections would be helpful :)

shikhargoswami commented 3 years ago

I don't know why it is getting this build error on julia_version=1.1 @aviks @Ayushk4 @oxinabox help needed.

Testing WordTokenizers
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package MbedTLS [739be429]:
 MbedTLS [739be429] log:
 ├─possible versions are: [0.5.13-0.5.14, 0.6.0-0.6.8, 0.7.0, 1.0.0-1.0.3] or uninstalled
 ├─restricted to versions 1.0.3 by an explicit requirement, leaving only versions 1.0.3
 └─restricted by julia compatibility requirements to versions: [0.5.13-0.5.14, 0.6.0-0.6.8] or uninstalled — no versions left
Stacktrace:
 [1] #propagate_constraints!#61(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int32}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:1007
 [2] propagate_constraints! at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:948 [inlined]
 [3] #simplify_graph!#121(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int32}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:1462
 [4] simplify_graph! at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\GraphType.jl:1462 [inlined] (repeats 2 times)
 [5] resolve_versions!(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}, ::Nothing) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:371
 [6] resolve_versions! at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:315 [inlined]
 [7] #add_or_develop#63(::Array{Base.UUID,1}, ::Symbol, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:1172
 [8] add_or_develop at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:1156 [inlined]
 [9] (::getfield(Pkg.Operations, Symbol("##40#44")){Bool,getfield(Pkg.Operations, Symbol("##68#70")){Pkg.Types.Context,getfield(Pkg.Operations, Symbol("##67#69")){Pkg.Types.Context,Cmd}},Pkg.Types.Context,Pkg.Types.PackageSpec,Pkg.Types.Context})(::String) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:874
 [10] mktempdir(::getfield(Pkg.Operations, Symbol("##40#44")){Bool,getfield(Pkg.Operations, Symbol("##68#70")){Pkg.Types.Context,getfield(Pkg.Operations, Symbol("##67#69")){Pkg.Types.Context,Cmd}},Pkg.Types.Context,Pkg.Types.PackageSpec,Pkg.Types.Context}, ::String) at .\file.jl:581
 [11] mktempdir at .\file.jl:579 [inlined]
 [12] #with_dependencies_loadable_at_toplevel#38(::Bool, ::Function, ::getfield(Pkg.Operations, Symbol("##68#70")){Pkg.Types.Context,getfield(Pkg.Operations, Symbol("##67#69")){Pkg.Types.Context,Cmd}}, ::Pkg.Types.Context, ::Pkg.Types.PackageSpec) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:853
 [13] #with_dependencies_loadable_at_toplevel at .\none:0 [inlined]
 [14] #test#66(::Bool, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\Operations.jl:1319
 [15] #test at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:0 [inlined]
 [16] #test#46(::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:198
 [17] #test at .\none:0 [inlined]
[18] #test#45 at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:180 [inlined]
 [19] #test at .\none:0 [inlined]
 [20] #test#42 at C:\cygwin\home\Administrator\buildbot\worker\package_win32\build\usr\share\julia\stdlib\v1.1\Pkg\src\API.jl:177 [inlined]
 [21] (::getfield(Pkg.API, Symbol("#kw##test")))(::NamedTuple{(:coverage,),Tuple{Bool}}, ::typeof(Pkg.API.test)) at .\none:0
 [22] top-level scope at none:0