Textmodel integration - Githubissues

Chandu-4444 commented 2 years ago

Can do the following:

lm = FastText.LanguageModel(true)
classifier = FastText.TextClassifier(lm)
FastText.train_classifier!(classifier) # Would throw an error as I haven't fully enabled the model to work with FastAI's data container.

lorenzoh commented 2 years ago

I think you've readded some files from TextModels.jl that we don't need, could you remove those? 🙂

Chandu-4444 commented 2 years ago

Sure! Will clean up in the next commit.

Chandu-4444 commented 2 years ago

Deleted the files from TextModels.jl that are unnecessary.
Add a utility function for padding the batches for sequence data (Very naively implemented, as a starting point though)

julia> batches = FastText.load_batchseq(data, task)
julia> batches[1][1]
92-element Vector{Vector{Int64}}:
 [25000, 25000, 25000, 25000, 25000, 25000, 25000, 25000]
 [633779, 633779, 633779, 633779, 633779, 633779, 633779, 633779]
 [2731, 34, 315, 354, 2087, 2209, 70, 1307]
 [44047, 435, 633779, 633779, 6589, 633779, 633779, 205]
 ⋮
 [0, 0, 0, 0, 0, 213, 0, 0]
 [0, 0, 0, 0, 0, 25, 0, 0]
 [0, 0, 0, 0, 0, 1778, 0, 0]

julia> batches[1][2]
8-element Vector{Int64}:
 1
 1
 1
 1
 1
 0
 1
 1

Chandu-4444 commented 2 years ago

fastai's way of encoding data doesn't include the removal of stop words (and Jeremy recommends it). So, I removed the stop word removal step.
I've added a vocab_size keyword argument to TextClassificationSingle.
Added <unk>, <pad> to the vocabulary.

A batch loader for textdata with padding length = max(sentence length in a batch).

ToucheSir commented 2 years ago

Should the vocab CSV files be checked in? I would've assumed they would be artifacts or DataDeps as well.

Chandu-4444 commented 2 years ago

julia> data, blocks = load(datarecipes()["imdb"])
((mapobs(loadfile, ObsView(::MLDatasets.FileDataset{typeof(identity), String}, ::Vector{Int64})), mapobs(parentname, ObsView(::MLDatasets.FileDataset{typeof(identity), String}, ::Vector{Int64}))), (Paragraph(), Label{String}(["neg", "pos"])))

julia> task = TextClassificationSingle(blocks, data)
SupervisedTask(Paragraph -> Label{String})

julia> model = FastAI.taskmodel(task, FastText.LanguageModel(false, task))
#90 (generic function with 1 method)

julia> batches = FastText.load_batchseq(data, task)
6250-element Vector{Tuple{Vector{Vector{Int64}}, WARNING: both Losses and NNlib export "ctc_loss"; uses of it in module Flux must be qualified
Flux.OneHotArray{UInt32, 2, 1, 2, Vector{UInt32}}}}:
 ([[35, 35, 35, 35], [3, 3, 3, 9], [40, 18025, 15, 14], [224, 10, 3541, 3040], [737, 34, 24, 505], [49, 7, 809, 3], [4, 4, 221, 3836], [1927, 104, 4, 
3], [7, 16, 629, 28440], [6, 351, 7, 17]  …  [2, 2, 2, 44], [2, 2, 2, 3], [2, 2, 2, 9839], [2, 2, 2, 17], [2, 2, 2, 1041], [2, 2, 2, 27], [2, 2, 2, 3], [2, 2, 2, 3836], [2, 2, 2, 3], [2, 2, 2, 28440]], [0 0 1 1; 1 1 0 0])

julia> using FluxTraining

julia> td, vd = splitobs(batches, at=0.9)

julia> using Flux

julia> learner = Learner(model, Flux.Losses.logitcrossentropy, callbacks=[Metrics(accuracy)]; data=(td, vd))
Learner()

julia> fit!(learner, 1)
Epoch 1 TrainingPhase():   0%|█                                                                                               |  ETA: 4 days, 3:35:31

Chandu-4444 commented 1 year ago

The changes have been merged to #258.

FluxML / FastAI.jl

Textmodel integration #250