dotnet / machinelearning-samples

Samples for ML.NET, an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
4.46k stars 2.68k forks source link

Score of NaN in Product Recommendation #620

Open NatElkins opened 5 years ago

NatElkins commented 5 years ago

Hi,

I am trying to adapt the F# product recommendation sample to a small set of data that I have.

There are three main differences between my data and the Amazon data set:

  1. My data set is much smaller, only about 78K entries.
  2. I'm using UserId and ProductId, instead of ProductId and CoPurchasedProductId.
  3. My ProductId is a string, rather than something that could be cast as a UInt32.

My issue is that I can only get a score of NaN for any prediction. I am unable to determine a way to diagnose this, maybe someone here can help?

Here's my code:

open Microsoft.ML
open Microsoft.ML.Data
open System
open Microsoft.ML.Trainers

[<CLIMutable>]
type ProductEntry = 
    {
        [<LoadColumn(0); KeyType(count=6248UL)>]
        UserId : uint32
        [<LoadColumn(1)>]
        ProductId : string
        [<NoColumn>]
        Label : float32
    }

[<CLIMutable>]
type Prediction = {Score : float32}

let trainDataPath = "/Users/nat/Downloads/user_product_prediction.csv"

let mlContext = MLContext()

let options = MatrixFactorizationTrainer.Options(MatrixColumnIndexColumnName = "UserIdEncoded", 
                                                 MatrixRowIndexColumnName = "ProductIdEncoded",
                                                 LossFunction = MatrixFactorizationTrainer.LossFunctionType.SquareLossOneClass,
                                                 LabelColumnName = "Label",
                                                 Alpha = 0.01,
                                                 Lambda = 0.025)

let matrixFactorizationTrainer = mlContext.Recommendation().Trainers.MatrixFactorization(options)

let pipeline = 
    EstimatorChain().Append(
        mlContext.Transforms.Conversion
            .MapValueToKey(inputColumnName="UserId",outputColumnName="UserIdEncoded"))
        .Append(
            mlContext.Transforms.Conversion
                .MapValueToKey(inputColumnName="ProductId",outputColumnName="ProductIdEncoded"))
        .Append(matrixFactorizationTrainer)

let traindata =
    let columns = 
        [|
            TextLoader.Column("Label", DataKind.Single, 0)
            TextLoader.Column("UserId", DataKind.UInt32, source = [|TextLoader.Range(0)|], keyCount = KeyCount 6248UL) 
            TextLoader.Column("ProductId", DataKind.String, source = [|TextLoader.Range(1)|]) 
        |]
    mlContext.Data.LoadFromTextFile(trainDataPath, columns, hasHeader=false, separatorChar=',')

let model = pipeline.Fit(traindata)

let predictionengine = mlContext.Model.CreatePredictionEngine<ProductEntry, Prediction>(model)

let productEntry = {ProductId = "farfetch-13164877"; UserId = (uint32 10650); Label = 0.f}

let prediction = predictionengine.Predict productEntry

printfn ""
printfn "For product entry %A the predicted score is %f" productEntry prediction.Score
printf "=============== End of process, hit any key to finish ==============="
Console.ReadKey() |> ignore

Even when selecting a UserId-ProductId combination that is in the file, I get a score of NaN.

So, I have a few questions:

  1. Am I mapping the ProductId string to a key correctly?
  2. What are some possible causes of the NaN response?
  3. I notice in the sample that the records of the fields have LoadColumn attributes, but they're also explicitly defined. Why is that?

My data set can be found here: user_product_prediction.txt

To try it out, just replace the Program.fs in the matrix factorization product recommendation F# sample with the program I pasted above. Download the txt file and change the path to the file as appropriate.

Any help or a pointer in the right direction would be deeply appreciated!

Thanks!

kevmal commented 5 years ago

On point (3), you don't need both. If I recall correctly, there was a point where samples got updated to go from one method to the other (attribute based approach), but there were issues (at the time) that meant this sample didn't update. The attributes just happen to remain but could be removed.

The main issue here is UserId is being loaded as a Key of count 6248UL. Looking at the data you have, UserId doesn't fit this definition. So it's good that it gets mapped to a key, but it doesn't make sense to load it as a key. So try updating as follows

[<CLIMutable>]
type ProductEntry = 
    {
        UserId : uint32
        ProductId : string
        Label : float32
    }

and


let traindata =
    let columns = 
        [|
            TextLoader.Column("Label", DataKind.Single, 0)
            TextLoader.Column("UserId", DataKind.UInt32, source = [|TextLoader.Range(0)|]) 
            TextLoader.Column("ProductId", DataKind.String, source = [|TextLoader.Range(1)|]) 
        |]
    mlContext.Data.LoadFromTextFile(trainDataPath, columns, hasHeader=false, separatorChar=',')
NatElkins commented 5 years ago

@kevmal You're right, removing the label and key counts result in an actual score. Thank you (note to MS devs, an error or warning log might be helpful here)!!!

I have a few other questions, maybe you can help answer them:

  1. There are 6248 unique user ids in the file. Is that not what KeyCount represents?
  2. If KeyCount can be determined automatically, why would anyone fill it in explicitly?
  3. Can you confirm my understanding of the "score"? My understanding is that the number produced itself is meaningless (if it does actually represent something please do let me know), and that you can only derive meaning by ranking a particular UserId-ProductId combo against other UserId-ProductId combos.

Thanks again for your help, I really appreciate it!