Remarks - Githubissues

tomseimandi commented 4 months ago

Intro: Insee -> les gens ne savent pas forcément ce que c'est ?
2.1: "interestingly, subsequent projects involving large datasets didn’t suffer much from this change, as their needs were actually very much in line with Tigani’s observations: the performance bottleneck for these projects was generally on the side of computational needs rather than storage capacity, making Hadoop-style clusters less relevant". Je ne comprends pas bien cet argument, je rate peut-être quelque chose
2.3: "Through object storage, users gain control over the storage layer, allowing them to experiment with diverse datasets without being constrained by the limited storage spaces typically allocated by IT departments" -> l'argument là c'est qu'on peut facilement augmenter l'espace de stockage pour fit les besoin non ? Pour clarifier il faudrait peut-être expliciter pourquoi les DSI allouent des "limited storage spaces typically" ?
3.1: "stored in Vault" -> besoin d'un lien ou de précisions ?
MLFlow -> MLflow ?
3.4: Vault mentionné une nouvelle fois sans information supplémentaire
4.1.3: "however, such architectures impose greater demands on the production setting, as they are much larger and often require specific hardware such as GPUs to be fine-tuned and perform inference with acceptable latency". Pas 100% convaincu par l'inclusion du fine-tuning dans la phrase, a priori ça c'est en marge de la production justement non ? A la limite pour l'inférence c'est entendable même si en réalité avec les volumes du cas d'usage Sirene c'est pas un argument valable je pense
4.1.3: "The fastText model relies on a bag-of-words model to obtain embeddings and a classification layer based on logistic regression. The bag-of-words approach involves representing a text as the set of vector representations of each of its constituent tokens" -> "...each of its constituent words" plutôt non ? Tokens trop générique. Justement fastText c'est du bag of ngrams, tokens = ngrams
"The specificity of the fastText model compared to other embeddings-based approaches is that embeddings are not only computed on words but also on word n-grams and character n-grams, providing more context and reducing biases due to spelling mistakes" -> je déplacerais après la phrase du dessus + je rajouterais une déf. de token = words + ngrams pour pouvoir modifier la phrase suivante
"With FastText, the embedding of a sentence is computed as a function of the individual embeddings, typically the average" -> "The embedding of a sentence is computed as the average of the individual token embeddings" (avec "token" défini juste au dessus)
Peut-être changer au moins le sous-titre de la Fig. 8, en fait l'étape de feature extraction ou tokenization ça fait pas réellement partie du modèle. En vrai peut-être que le graphique est carrément superflu ?
4.2.2: "Monitoring is also an essential part of this process: a model deployed in production needs to be continuously assessed so as to detect data or concept drifts that may reduce the predictive performance of the model and thus necessitate further adjustments, such as re-training or fine-tuning the model." -> peut-être reformuler légèrement parce que la détection de data drifts c'est pas vraiment un assessment du modèle, plutôt des données d'inférence
4.3.3: "As a result, data scientists are not fully autonomous when it comes to prototyping and testing updated versions of the model or the API, which limit the potential for continuous improvement.": c'est pas en redéployant l'API qu'on testerait une nouvelle version du modèle si ? Peut-être juste enlever "the model or" du coup ?
4.3.5: "After several months of the first version of the model running in production, the need to build a gold-standard test set became increasingly apparent. First, such a set was not accessible at the time of the experimentation phase, so we relied on a subset of the training dataset to perform evaluation, knowing the labeling quality was not optimal. Collecting a gold-standard sample would thus enable us to get an unbiased view of the model’s performance in production on real data, particularly on data that has been automatically coded" -> j'aurais envie de parler de l'idée qu'on a envie de collecter des données d'évaluation un peu en continu non ? Dans "gold standard" on a plutôt l'idée d'un jeu d'évaluation fixé et qui n'évolue pas
La phrase d'avant: un peu confusing, on parle de "collection of new training data" avant de switcher sur la collection d'un jeu d'éval.
Phrase d'après: "Another reason...", la formulation est un peu gênante je trouve parce que là on reswitch sur de la training data
"Against that background, an annotation campaign has been initiated in early 2024 to build the new training set.": plutôt jeu d'éval en premier lieu non ? Bref les remarques sur cette subsection -> peut-être un peu mieux distinguer train et eval.

tomseimandi commented 4 months ago

Franchement c'est top

tomseimandi commented 4 months ago

J'ai repéré qques typos mais sûrement raté des trucs, cf #4

avouacr commented 4 months ago

merci :)

ThomasFaria / retex-innovation-insee

Remarks #3