JetBrains-Research / code2seq

PyTorch's implementation of the code2seq model.
MIT License
61 stars 18 forks source link

Save model for prediction #115

Closed Avv22 closed 2 years ago

Avv22 commented 2 years ago

Thanks JetBrains team,

I would like to train the model first on a Python dataset and Java dataset both of size 20k files. Then once trained do the prediction to get embeddings for both java and python files. We would like to have 1 vector embedding predicted after traiing of size say 120 for each file. Is that possible please with your implementation?

Also, do you have a trained model to avoid training anyway please? Because we had issue training code2seq original model on Python dataset as the tensor is very large and we got OOM error (out of memory). Our PC is 16 GB RAM and 4GB GPU. The dataset we used to train code2seq original model was only around 1 GB for both train and test, but we got OOM error. Can we run your model or we need higher hardware requirements please?

Thanks.

SpirinEgor commented 2 years ago

Hi!

I didn't understand properly, what kind of embedding do you need. If you want to extract a vector representation of each function/file, I suggest you look at the Code2Class model. It is a combination of the encoder from code2seq and the decoder from code2vec. It is not the best interface since it is developed for the classification problem and output vector size corresponds to the number of classes in vocabulary. But you can look through the code and create your own implementation that uses PathEncoder for encoder and custom MLP to aggregate path embeddings into a single vector.

Unfortunately, there are no weights for Python at all. And for Java, we have only for method name prediction, but it's not ready to be published yet. I hope I will have some time in the near future to add information about data, weights, reproducibility results.

From my experience, 4GB GPU is quite small. The original implementation is quite sensitive for it, I wasn't able to run on GPU with less than 15GB of GPU RAM. This implementation may help you, but you need to set up a small batch size and play with model sizes.

Avv22 commented 2 years ago

@SpirinEgor.

Thank you. I did that yesterday and eidted config.py file on main code2seq repository and it worked. I changed though shuffle size and other window parameters to limit number of data that fit into the RAM, but I am not sure if that could affect quality or not. I will try batch size as you said and see what happens.

For embeddings, for example after training the model on Java/Python, I would like to do prediction to get embeddings as follows:

[file1.java] --predict-->[embedding1 for file 1] [file1.java] --predict-->[embedding1 for file 1] . . . [file20000.java] --predict-->[embedding 20000 for file 20000]

Same thing for Python:

[file1.py] --predict-->[embedding1 for file 1] [file2.py] --predict-->[embedding2 for file 1] . . . [file20000.py] --predict-->[embedding 20000 for file 20000]

code2seq however does not give us the above embeddings, but it only give method name and not context vector embedding.

Edited: I realized that after we finish training of code2seq, the prediction will be simply method names one for each file! so for our 20k files, we would have only 1 method name for each file. However, what we are looking for is a context vector that represent the whole file we have in our dataset. I am not sure what we can do with only method name as we are looking for a vector that has global sense of the file so that we can do further processing as

Same thing for Python:

[file1.py] --predict-->[embedding1 (context vector) for file 1] [file2.py] --predict-->[embedding2 (context vector) for file 1] . . . [file20000.py] --predict-->[embedding 20000 (context vector) for file 20000]

Questions: So probably code2class is better candidate in our case:

  1. Does code2class predicts a context vector representing the whole class after training or what output if gives specifically please?
  2. I am not sure if we can use your code2class on b Python and Java or it only works with Java? We have 20k python files and 20k Java files. Code2seq just give method name, so I don't this is good as we are looking for classification later on and to feed embed vector to another network for further processing, but code2seq just give one word (method name). For code2class again, I have AST paths extracted and stored in .c2s format for both Python and Java.
  3. I guess first I should train code2class and then save the model for prediction later on on my dataset, is that correct please? Can you guide me on that if that is the case?

Thank you.

SpirinEgor commented 2 years ago
  1. You can look through Code2Class code and see that this combination of PathEncoder and Classifier. The classifier is a multi-layer perceptron with softmax at the end. The output size is equal to the size of label_to_id vocabulary.

As I understand, you need vector representations. For that, you need a small change of Code2Class:

class Code2Vec(LightningModule):
    def __init__(self, model_config: DictConfig, optimizer_config: DictConfig, vocabulary: Vocabulary):
        super().__init__()
        self.save_hyperparameters()
        self._optim_config = optimizer_config

        self._encoder = PathEncoder(
            model_config,
            len(vocabulary.token_to_id),
            vocabulary.token_to_id[Vocabulary.PAD],
            len(vocabulary.node_to_id),
            vocabulary.node_to_id[Vocabulary.PAD],
        )
        # HERE I SET UP EMBEDDINGS SIZE
        self._classifier = Classifier(model_config, config.output_size)

You also need to adopt the loss function for proper training these embeddings. _shared_step methods describe its calculation for the classification task.

  1. Model doesn't care about language. You may fit as many different languages as you want. Feel free to experiment with it.

  2. Didn't understand what you mean here. You may train the model and validate it after each epoch. In the end, checkpoint would be saved. You can use this checkpoint for further testing or inference.

Avv22 commented 2 years ago

@SpirinEgor.

Your help is invaluable. Thank you. So, I have now .c2s for python dataset that I run extractor. We got from extractor train.c2s, test.c2s, val.c2s, dict.c2s files for Python that through code2seq team extractor. How I should proceed please to train your code2class model with the edit you provided above :

I already installed your library:

pip install code2seq

I mean how I should start training and provide data for your code2class model and is it possible please to reduce model parameters so that it fits into my PC as I have low requirements (4 GB GPU and 16 GB RAM).

SpirinEgor commented 2 years ago

It's better for you to look through code2seq_wrapper. There you can find model and data module initializations, and pass them into the training method.

To configure model, i.e. set batch size and model sizes, check configuration files: configs

Avv22 commented 2 years ago

@SpirinEgor.

You mean since I am looking for context vector output, I should train code2class and look to code2class wrapper https://github.com/JetBrains-Research/code2seq/blob/master/code2seq/code2class_wrapper.py.

Question you provide some steps please how to start training of code2class and what data to pass to code2class? What is the data format I should provide to code2class wrapper, i.e., I should pass train.c2s, test.c2s, val.c2s, dict.c2s data or what please? I.e.,

python code2seq_wrapper.py train <DATA?>

You have specified that train arg will start training code2class model. But I am not sure what data I should pass in and in what format please.

SpirinEgor commented 2 years ago

Yeah, you can look at that wrapper. Actually, they are quite the same.

YAML config file declares data folder (https://github.com/JetBrains-Research/code2seq/blob/master/config/code2class-poj104.yaml#L1). So, you can create a folder wherever you need, put all c2s files in it, and set a path to it in a config.

To start the training you need to run

PYTHONPATH="." python code2seq/code2seq_wrapper.py train -c $PATH_TO_YOUR_CONFIG
Avv22 commented 2 years ago

@SpirinEgor.

Thank you very much.

Lastly. So, I don't have to have file dict.c2s as code2seq original implementation needs it. If you remember, I opened an issue in your astminer project where we found that astmienr you developed does not extract dict.c2s file. If your code2class needs dict.c2s, I will use code2seq extractor as it outputs dict.c2s too unless training your code2class model does not need it? So I have to provide a path in yaml file above to train.c2s, test.c2s, val.c2s without dict.c2s please?

Edit: I opend (https://github.com/JetBrains-Research/code2seq/blob/master/config/code2class-poj104.yaml#L1) and then I scrolled down in the file to data attribute and I downloaded the data you trained your code2class with:


data:
  url: https://s3.eu-west-1.amazonaws.com/datasets.ml.labs.aws.intellij.net/poj-104/poj-104-code2seq.tar.gz
  num_workers: 0

I opened that folder below:

image

I see there is a file vocabulary.pkl. I am not sure what is this file please? Do we need it to train code2class? If we need vocabulary.pkl, then neither code2class extractor nor astminer extractor give this file. Can you please help us how to get similar file for our dataset if needed?

SpirinEgor commented 2 years ago

You are absolutely right, for our implementation, you don't need dict.c2s. vocabulary.pkl is our version of the dictionary for the model. It will be automatically generated in case of missing and then reused for the next launches.

Do not use pickle from downloaded data, it contains information for POJ dataset. Start training script with your 3 files and the model will collect vocabulary on its own.

Avv22 commented 2 years ago

@SpirinEgor.

We got Python json dataset (here) from code2seq repository. Then we run their extractor to get train.c2c, test.c2s, val.c2s. We tried to run your model with data we got train.c2s, test.c2s, val.c2s:

$ python code2seq/code2seq_wrapper.py train -c config/code2class-poj104.yaml but we got the following error:

osboxes@osboxes:~/code2class-master$ python code2seq/code2seq_wrapper.py train -c config/code2class-poj104.yaml

model                 | data                 | train                 | optimizer          
------------------------------------------------------------------------------------------
embedding_size: 128   | url: None            | n_epochs: 10          | optimizer: Momentum
encoder_dropout: 0.25 | num_workers: 0       | patience: 10          | nesterov: True     
encoder_rnn_size: 128 | max_labels: None     | clip_norm: 5          | lr: 0.01           
use_bi_rnn: True      | max_label_parts: 1   | teacher_forcing: 1.0  | weight_decay: 0    
rnn_num_layers: 1     | max_tokens: 190000   | val_every_epoch: 1    | decay_gamma: 0.95  
classifier_layers: 2  | max_token_parts: 5   | save_every_epoch: 1   |                    
classifier_size: 128  | path_length: 9       | log_every_n_steps: 10 |                    
activation: relu      | max_context: 200     |                       |                    
                      | random_context: True |                       |                    
                      | batch_size: 512      |                       |                    
                      | test_batch_size: 768 |                       |                    
Can't find vocabulary, collect it from train holdout
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 167743/167743 [09:57<00:00, 280.68it/s]
Count 19158 label, top-5: [('test', 33921), ('get', 16223), ('init', 12213), ('set', 7051), ('create', 4467)]
Count 35998 token, top-5: [('', 150875091), ('self', 3618002), ('NUM', 3036523), ('none', 1104659), ('name', 490015)]
Count 122 node, top-5: [('NameLoad', 20986589), ('Assign', 14273117), ('Call', 13417335), ('body', 13388558), ('NameStore', 6903018)]
Traceback (most recent call last):
  File "code2seq/code2seq_wrapper.py", line 55, in <module>
    train_code2seq(__config)
  File "code2seq/code2seq_wrapper.py", line 29, in train_code2seq
    data_module = PathContextDataModule(config.data_folder, config.data)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/path_context_data_module.py", line 28, in __init__
    self._vocabulary = self.setup_vocabulary()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/path_context_data_module.py", line 49, in setup_vocabulary
    return Vocabulary(vocabulary_path, self._config.labels_count, self._config.tokens_count, self._is_class)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/omegaconf/dictconfig.py", line 354, in __getattr__
    key=key, value=None, cause=e, type_override=ConfigAttributeError
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/omegaconf/base.py", line 196, in _format_and_raise
    type_override=type_override,
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/omegaconf/_utils.py", line 821, in format_and_raise
    _raise(ex, cause)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/omegaconf/_utils.py", line 719, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/omegaconf/dictconfig.py", line 438, in _get_impl
    node = self._get_node(key=key, throw_on_missing_key=True)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/omegaconf/dictconfig.py", line 470, in _get_node
    raise ConfigKeyError(f"Missing key {key}")
omegaconf.errors.ConfigAttributeError: Missing key labels_count
    full_key: data.labels_count
    object_type=dict

This is code2class-poj104.yaml file, so I am not sure data.labels_count should be below or not, but code2class-poj104.yaml is the file in your repository and we did not change it, but we changed only datafolder path:

data_folder: /home/osboxes/code2class-master/data

checkpoint: null

seed: 7
# Training in notebooks (e.g. Google Colab) may crash with too small value
progress_bar_refresh_rate: 1
print_config: true

wandb:
  project: Code2Class -- poj-104
  group: null
  offline: full

data:
  url: 
  num_workers: 0

  max_labels: null
  max_label_parts: 1
  max_tokens: 190000
  max_token_parts: 5
  path_length: 9

  max_context: 200
  random_context: true

  batch_size: 512
  test_batch_size: 768

model:
  # Encoder
  embedding_size: 128
  encoder_dropout: 0.25
  encoder_rnn_size: 128
  use_bi_rnn: true
  rnn_num_layers: 1

  # Classifier
  classifier_layers: 2
  classifier_size: 128
  activation: relu

optimizer:
  optimizer: "Momentum"
  nesterov: true
  lr: 0.01
  weight_decay: 0
  decay_gamma: 0.95

train:
  n_epochs: 10
  patience: 10
  clip_norm: 5
  teacher_forcing: 1.0
  val_every_epoch: 1
  save_every_epoch: 1
  log_every_n_steps: 10

Can you help with this issue please?

SpirinEgor commented 2 years ago

Yeah, this config is a little outdated. I create an issue to fix it in the near future. In the last version of the training pipeline, data.max_labels and data.max_tokens were replaced with data.labels_count and data.tokens_count. That is, instead of limiting the vocabulary by absolute values, we now limit it by minimum number of occurrences of a given token.

You can open a pickle with vocabulary and choose these values based on the counters for each entity.

Avv22 commented 2 years ago

@SpirinEgor. Thank you very much. Appreciated. Both training pipeline and config files has to be changed accordingly. I am not able to open vocabulary file as it's not there for our dataset.

SpirinEgor commented 2 years ago

Since it is collected during the first launch, it should be stored in a data folder. Look for vocabulary.pkl.

Avv22 commented 2 years ago

@SpirinEgor. Thank you.

Edit: the dataset we used is from here, so I am not sure if this is the reason for the error below with debugging information given that I am not sure if the input data to your model should come from classes, methods or it does not matter please?

I see the vocabulary.pkl of our train.c2s since this was collected for train.c2s has 3 lables:

The number of items in each dictionary item (label, token, node) we found in vocabulary.pkl:

  1. Can you please confirm this: we should replace data.labels_count and data.tokens_count only in the yaml file (https://github.com/JetBrains-Research/code2seq/blob/master/config/code2class-poj104.yaml#L1) only and leave data.node? So, based on our file, it yaml file should be:

data_folder: path

checkpoint: null

seed: 7

Training in notebooks (e.g. Google Colab) may crash with too small value

progress_bar_refresh_rate: 1 print_config: true

wandb: project: Code2Class -- poj-104 group: null offline: full

data: url: num_workers: 0 max_labels: null max_label_parts: 1 tokens_count: 35998 labels_count: 19158 max_token_parts: 5 path_length: 9

max_context: 200 random_context: true

batch_size: 512 test_batch_size: 768

model:

Encoder

embedding_size: 128 encoder_dropout: 0.25 encoder_rnn_size: 128 use_bi_rnn: true rnn_num_layers: 1 decoder_num_layers: 1 encoder_num_layers: 1 decoder_size: 320 rnn_dropout: 0.5

Classifier

classifier_layers: 2 classifier_size: 128 activation: relu

optimizer: optimizer: "Momentum" nesterov: true lr: 0.01 weight_decay: 0 decay_gamma: 0.95

train: n_epochs: 10 patience: 10 clip_norm: 5 teacher_forcing: 1.0 val_every_epoch: 1 save_every_epoch: 1 log_every_n_steps: 10


I got the following error after I edied `yaml` file:

(scikit-dev) osboxes@osboxes:~/code2class-master$ python code2seq/code2seq_wrapper.py train -c config/code2class-poj104.yaml

model | data | train | optimizer

embedding_size: 128 | url: None | n_epochs: 10 | optimizer: Momentum encoder_dropout: 0.25 | num_workers: 0 | patience: 10 | nesterov: True
encoder_rnn_size: 128 | max_labels: None | clip_norm: 5 | lr: 0.01
use_bi_rnn: True | max_label_parts: 1 | teacher_forcing: 1.0 | weight_decay: 0
rnn_num_layers: 1 | tokens_count: 35998 | val_every_epoch: 1 | decay_gamma: 0.95
decoder_num_layers: 1 | labels_count: 19158 | save_every_epoch: 1 |
encoder_num_layers: 1 | max_token_parts: 5 | log_every_n_steps: 10 |
decoder_size: 320 | path_length: 9 | |
rnn_dropout: 0.5 | max_context: 200 | |
classifier_layers: 2 | random_context: True | |
classifier_size: 128 | batch_size: 512 | |
activation: relu | test_batch_size: 768 | |
Global seed set to 7 wandb: WARNING resume will be ignored since W&B syncing is set to offline. Starting a new run with run id 185b38om. /home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing. GPU available: False, used: False TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs Dataset is already downloaded ┏━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┃ ┃ Name ┃ Type ┃ Params ┃ ┡━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ │ 0 │ _Code2Seq__metrics │ MetricCollection │ 0 │ │ 1 │ _encoder │ PathEncoder │ 468 K │ │ 2 │ _decoder │ Decoder │ 886 K │ │ 3 │ _Code2Seq__loss │ SequenceCrossEntropyLoss │ 0 │ └───┴────────────────────┴──────────────────────────┴────────┘ Trainable params: 1.4 M
Non-trainable params: 0
Total params: 1.4 M
Total estimated model params size (MB): 5
0%| | 0/42917 [00:00<?, ?it/s] 28%|##8 | 12019/42917 [00:00<00:00, 120180.66it/s] 56%|#####6 | 24038/42917 [00:00<00:00, 109750.27it/s] 82%|########1 | 35073/42917 [00:00<00:00, 109377.67it/s] 100%|##########| 42917/42917 [00:00<00:00, 108364.28it/s]

/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py:112: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 3 which is the number of cpus on this machine) in theDataLoader` init to improve performance. f"The dataloader, {name}, does not have many workers which may be a bottleneck." Validation Sanity Check ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/2 0:00:02 • -:--:-- 0.00it/s
Traceback (most recent call last): File "code2seq/code2seq_wrapper.py", line 55, in train_code2seq(config) File "code2seq/code2seq_wrapper.py", line 34, in train_code2seq train(code2seq, data_module, config) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/utils/train.py", line 55, in train trainer.fit(model=model, datamodule=data_module) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 738, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run self._dispatch() File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _dispatch self.training_type_plugin.start_training(self) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training self._results = trainer.run_stage() File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1284, in run_stage return self._run_train() File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1306, in _run_train self._run_sanity_check(self.lightning_module) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1370, in _run_sanity_check self._evaluation_loop.run() File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 145, in run self.advance(*args, *kwargs) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 109, in advance dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 140, in run self.on_run_start(args, kwargs) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 86, in on_run_start self._dataloader_iter = _update_dataloader_iter(data_fetcher, self.batch_progress.current.ready) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/utilities.py", line 121, in _update_dataloader_iter dataloader_iter = enumerate(data_fetcher, batch_idx) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/utilities/fetching.py", line 199, in iter self.prefetching(self.prefetch_batches) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/utilities/fetching.py", line 258, in prefetching self._fetch_next_batch() File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/utilities/fetching.py", line 300, in _fetch_next_batch batch = next(self.dataloader_iter) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/path_context_dataset.py", line 50, in getitem__ label = self.tokenize_class(raw_label, self._vocab.label_to_id) File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/path_context_dataset.py", line 66, in tokenize_class return [vocab[raw_class]] KeyError: 'extras'

wandb: Waiting for W&B process to finish, PID 6721... (failed 1). wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/osboxes/code2class-master/wandb/offline-run-20211201_150200-185b38om wandb: Find logs at: ./wandb/offline-run-20211201_150200-185b38om/logs/debug.log wandb:


So, this is the `debug` file:

2021-12-01 15:02:00,656 INFO MainThread:6713 [wandb_setup.py:_flush():71] setting env: {'mode': 'dryrun'} 2021-12-01 15:02:00,656 INFO MainThread:6713 [wandb_init.py:_log_setup():371] Logging user logs to /home/osboxes/code2class-master/wandb/offline-run-20211201_150200-185b38om/logs/debug.log 2021-12-01 15:02:00,656 INFO MainThread:6713 [wandb_init.py:_log_setup():372] Logging internal logs to /home/osboxes/code2class-master/wandb/offline-run-20211201_150200-185b38om/logs/debug-internal.log 2021-12-01 15:02:00,658 INFO MainThread:6713 [wandb_init.py:init():404] calling init triggers 2021-12-01 15:02:00,658 INFO MainThread:6713 [wandb_init.py:init():411] wandb.init called with sweep_config: {} config: {'data_folder': '/home/osboxes/code2class-master/data', 'checkpoint': None, 'seed': 7, 'progress_bar_refresh_rate': 1, 'print_config': True, 'wandb': {'project': 'Code2Class -- poj-104', 'group': None, 'offline': 'full'}, 'data': {'url': None, 'num_workers': 0, 'max_labels': None, 'max_label_parts': 1, 'tokens_count': 35998, 'labels_count': 19158, 'max_token_parts': 5, 'path_length': 9, 'max_context': 200, 'random_context': True, 'batch_size': 512, 'test_batch_size': 768}, 'model': {'embedding_size': 128, 'encoder_dropout': 0.25, 'encoder_rnn_size': 128, 'use_bi_rnn': True, 'rnn_num_layers': 1, 'decoder_num_layers': 1, 'encoder_num_layers': 1, 'decoder_size': 320, 'rnn_dropout': 0.5, 'classifier_layers': 2, 'classifier_size': 128, 'activation': 'relu'}, 'optimizer': {'optimizer': 'Momentum', 'nesterov': True, 'lr': 0.01, 'weight_decay': 0, 'decay_gamma': 0.95}, 'train': {'n_epochs': 10, 'patience': 10, 'clip_norm': 5, 'teacher_forcing': 1.0, 'val_every_epoch': 1, 'save_every_epoch': 1, 'log_every_n_steps': 10}} 2021-12-01 15:02:00,659 INFO MainThread:6713 [wandb_init.py:init():449] starting backend 2021-12-01 15:02:00,659 INFO MainThread:6713 [backend.py:_multiprocessing_setup():97] multiprocessing start_methods=fork,spawn,forkserver, using: spawn 2021-12-01 15:02:00,671 INFO MainThread:6713 [backend.py:ensure_launched():199] starting backend process... 2021-12-01 15:02:00,678 INFO MainThread:6713 [backend.py:ensure_launched():205] started backend process with pid: 6721 2021-12-01 15:02:00,680 INFO MainThread:6713 [wandb_init.py:init():458] backend started and connected 2021-12-01 15:02:00,702 INFO MainThread:6713 [wandb_init.py:init():519] updated telemetry 2021-12-01 15:02:00,704 INFO MainThread:6713 [wandb_init.py:init():592] starting run threads in backend 2021-12-01 15:02:05,451 INFO MainThread:6713 [wandb_run.py:_console_start():1816] atexit reg 2021-12-01 15:02:05,453 INFO MainThread:6713 [wandb_run.py:_redirect():1690] redirect: SettingsConsole.REDIRECT 2021-12-01 15:02:05,455 INFO MainThread:6713 [wandb_run.py:_redirect():1695] Redirecting console. 2021-12-01 15:02:05,458 INFO MainThread:6713 [wandb_run.py:_redirect():1751] Redirects installed. 2021-12-01 15:02:05,458 INFO MainThread:6713 [wandb_init.py:init():619] run started, returning control to user process 2021-12-01 15:02:05,632 INFO MainThread:6713 [wandb_run.py:_config_callback():962] config_cb None None {'model_config/embedding_size': 128, 'model_config/encoder_dropout': 0.25, 'model_config/encoder_rnn_size': 128, 'model_config/use_bi_rnn': True, 'model_config/rnn_num_layers': 1, 'model_config/decoder_num_layers': 1, 'model_config/encoder_num_layers': 1, 'model_config/decoder_size': 320, 'model_config/rnn_dropout': 0.5, 'model_config/classifier_layers': 2, 'model_config/classifier_size': 128, 'model_config/activation': 'relu', 'optimizer_config/optimizer': 'Momentum', 'optimizer_config/nesterov': True, 'optimizer_config/lr': 0.01, 'optimizer_config/weight_decay': 0, 'optimizer_config/decay_gamma': 0.95, 'vocabulary': '<code2seq.data.vocabulary.Vocabulary object at 0x7f42981c7278>', 'teacher_forcing': 1.0} 2021-12-01 15:02:07,823 INFO MainThread:6713 [wandb_run.py:_atexit_cleanup():1786] got exitcode: 1 2021-12-01 15:02:07,823 INFO MainThread:6713 [wandb_run.py:_restore():1758] restore 2021-12-01 15:02:10,552 INFO MainThread:6713 [wandb_run.py:_wait_for_finish():1916] got exit ret: 2021-12-01 15:02:10,664 INFO MainThread:6713 [wandb_run.py:_wait_for_finish():1916] got exit ret: done: true exit_result { } local_info { }

SpirinEgor commented 2 years ago

It seems that you misunderstood _count property. Its value determines the minimal number of token occurrences included in the vocabulary. For example, if token_count: 10 then only tokens that were met at least 10 times in training data will be included in the vocabulary. You set so high values, probably, vocabulary is empty, and therefore extracting the key fault.

Another problem that you will face: you extract data for method name prediction, this sequence generation problem, not the classification. Thus, during the validation or testing, this pipeline will fail because of method names that were not in training data. You should convert your data to the classification format, or change the way of extracting the label for each code snippet (e.g. poj-104 contains 104 classes, therefore the label in data is always one of these numbers).

Avv22 commented 2 years ago

@SpirinEgor.

Thank you.

  1. From vocabulary.pkl file, I see that it counts the occurrence of each token in label, token, node. So using Counter class, I get their occurrence in ascending order and I got that min occurrence is 1 for both label and token:

image

So, the yaml file I set both data.tokens_count=10, data.labels_count: 10, this will not consider probably those tokens and labels with occurence=1 and it worked, but it stopped at the following error:

(scikit-dev) osboxes@osboxes:~/code2class-master$ python code2seq/code2seq_wrapper.py train -c config/code2class-poj104.yaml

model                 | data                 | train                 | optimizer          
------------------------------------------------------------------------------------------
embedding_size: 128   | url: None            | n_epochs: 10          | optimizer: Momentum
encoder_dropout: 0.25 | num_workers: 0       | patience: 10          | nesterov: True     
encoder_rnn_size: 128 | max_labels: None     | clip_norm: 5          | lr: 0.01           
use_bi_rnn: True      | max_label_parts: 1   | teacher_forcing: 1.0  | weight_decay: 0    
rnn_num_layers: 1     | tokens_count: 10     | val_every_epoch: 1    | decay_gamma: 0.95  
decoder_num_layers: 1 | labels_count: 10     | save_every_epoch: 1   |                    
encoder_num_layers: 1 | max_token_parts: 5   | log_every_n_steps: 10 |                    
decoder_size: 320     | path_length: 9       |                       |                    
rnn_dropout: 0.5      | max_context: 200     |                       |                    
classifier_layers: 2  | random_context: True |                       |                    
classifier_size: 128  | batch_size: 512      |                       |                    
activation: relu      | test_batch_size: 768 |                       |                    
Global seed set to 7
wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id 10xgy3s9.
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Dataset is already downloaded
┏━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃   ┃ Name               ┃ Type                     ┃ Params ┃
┡━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0 │ _Code2Seq__metrics │ MetricCollection         │      0 │
│ 1 │ _encoder           │ PathEncoder              │  4.4 M │
│ 2 │ _decoder           │ Decoder                  │  2.4 M │
│ 3 │ _Code2Seq__loss    │ SequenceCrossEntropyLoss │      0 │
└───┴────────────────────┴──────────────────────────┴────────┘
Trainable params: 6.8 M                                                                                                                                                                                     
Non-trainable params: 0                                                                                                                                                                                     
Total params: 6.8 M                                                                                                                                                                                         
Total estimated model params size (MB): 27                                                                                                                                                                  
  0%|          | 0/42917 [00:00<?, ?it/s]
 26%|##5       | 11108/42917 [00:00<00:00, 111056.81it/s]
 56%|#####6    | 24164/42917 [00:00<00:00, 122513.75it/s]
 85%|########4 | 36416/42917 [00:00<00:00, 122400.21it/s]
100%|##########| 42917/42917 [00:00<00:00, 125387.08it/s]

/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py:112: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a
bottleneck. Consider increasing the value of the `num_workers` argument` (try 3 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Validation Sanity Check ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/2 0:00:00 • -:--:-- 0.00it/s  
Traceback (most recent call last):
  File "code2seq/code2seq_wrapper.py", line 55, in <module>
    train_code2seq(__config)
  File "code2seq/code2seq_wrapper.py", line 34, in train_code2seq
    train(code2seq, data_module, config)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/utils/train.py", line 55, in train
    trainer.fit(model=model, datamodule=data_module)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 738, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
    self._dispatch()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1284, in run_stage
    return self._run_train()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1306, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1370, in _run_sanity_check
    self._evaluation_loop.run()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 109, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 140, in run
    self.on_run_start(*args, **kwargs)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 86, in on_run_start
    self._dataloader_iter = _update_dataloader_iter(data_fetcher, self.batch_progress.current.ready)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/utilities.py", line 121, in _update_dataloader_iter
    dataloader_iter = enumerate(data_fetcher, batch_idx)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/utilities/fetching.py", line 199, in __iter__
    self.prefetching(self.prefetch_batches)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/utilities/fetching.py", line 258, in prefetching
    self._fetch_next_batch()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/utilities/fetching.py", line 300, in _fetch_next_batch
    batch = next(self.dataloader_iter)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/path_context_dataset.py", line 50, in __getitem__
    label = self.tokenize_class(raw_label, self._vocab.label_to_id)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/path_context_dataset.py", line 66, in tokenize_class
    return [vocab[raw_class]]
KeyError: 'parse|pem|key'

wandb: Waiting for W&B process to finish, PID 14131... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/osboxes/code2class-master/wandb/offline-run-20211202_034626-10xgy3s9
wandb: Find logs at: ./wandb/offline-run-20211202_034626-10xgy3s9/logs/debug.log

Debug file:

2021-12-02 03:46:26,785 INFO    MainThread:14123 [wandb_setup.py:_flush():71] setting env: {'mode': 'dryrun'}
2021-12-02 03:46:26,786 INFO    MainThread:14123 [wandb_init.py:_log_setup():371] Logging user logs to /home/osboxes/code2class-master/wandb/offline-run-20211202_034626-10xgy3s9/logs/debug.log
2021-12-02 03:46:26,786 INFO    MainThread:14123 [wandb_init.py:_log_setup():372] Logging internal logs to /home/osboxes/code2class-master/wandb/offline-run-20211202_034626-10xgy3s9/logs/debug-internal.log
2021-12-02 03:46:26,788 INFO    MainThread:14123 [wandb_init.py:init():404] calling init triggers
2021-12-02 03:46:26,788 INFO    MainThread:14123 [wandb_init.py:init():411] wandb.init called with sweep_config: {}
config: {'data_folder': '/home/osboxes/code2class-master/data', 'checkpoint': None, 'seed': 7, 'progress_bar_refresh_rate': 1, 'print_config': True, 'wandb': {'project': 'Code2Class -- poj-104', 'group': None, 'offline': 'full'}, 'data': {'url': None, 'num_workers': 0, 'max_labels': None, 'max_label_parts': 1, 'tokens_count': 10, 'labels_count': 10, 'max_token_parts': 5, 'path_length': 9, 'max_context': 200, 'random_context': True, 'batch_size': 512, 'test_batch_size': 768}, 'model': {'embedding_size': 128, 'encoder_dropout': 0.25, 'encoder_rnn_size': 128, 'use_bi_rnn': True, 'rnn_num_layers': 1, 'decoder_num_layers': 1, 'encoder_num_layers': 1, 'decoder_size': 320, 'rnn_dropout': 0.5, 'classifier_layers': 2, 'classifier_size': 128, 'activation': 'relu'}, 'optimizer': {'optimizer': 'Momentum', 'nesterov': True, 'lr': 0.01, 'weight_decay': 0, 'decay_gamma': 0.95}, 'train': {'n_epochs': 10, 'patience': 10, 'clip_norm': 5, 'teacher_forcing': 1.0, 'val_every_epoch': 1, 'save_every_epoch': 1, 'log_every_n_steps': 10}}
2021-12-02 03:46:26,789 INFO    MainThread:14123 [wandb_init.py:init():449] starting backend
2021-12-02 03:46:26,789 INFO    MainThread:14123 [backend.py:_multiprocessing_setup():97] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2021-12-02 03:46:26,814 INFO    MainThread:14123 [backend.py:ensure_launched():199] starting backend process...
2021-12-02 03:46:26,827 INFO    MainThread:14123 [backend.py:ensure_launched():205] started backend process with pid: 14131
2021-12-02 03:46:26,828 INFO    MainThread:14123 [wandb_init.py:init():458] backend started and connected
2021-12-02 03:46:26,865 INFO    MainThread:14123 [wandb_init.py:init():519] updated telemetry
2021-12-02 03:46:26,872 INFO    MainThread:14123 [wandb_init.py:init():592] starting run threads in backend
2021-12-02 03:46:31,886 INFO    MainThread:14123 [wandb_run.py:_console_start():1816] atexit reg
2021-12-02 03:46:31,890 INFO    MainThread:14123 [wandb_run.py:_redirect():1690] redirect: SettingsConsole.REDIRECT
2021-12-02 03:46:31,893 INFO    MainThread:14123 [wandb_run.py:_redirect():1695] Redirecting console.
2021-12-02 03:46:31,896 INFO    MainThread:14123 [wandb_run.py:_redirect():1751] Redirects installed.
2021-12-02 03:46:31,896 INFO    MainThread:14123 [wandb_init.py:init():619] run started, returning control to user process
2021-12-02 03:46:32,260 INFO    MainThread:14123 [wandb_run.py:_config_callback():962] config_cb None None {'model_config/embedding_size': 128, 'model_config/encoder_dropout': 0.25, 'model_config/encoder_rnn_size': 128, 'model_config/use_bi_rnn': True, 'model_config/rnn_num_layers': 1, 'model_config/decoder_num_layers': 1, 'model_config/encoder_num_layers': 1, 'model_config/decoder_size': 320, 'model_config/rnn_dropout': 0.5, 'model_config/classifier_layers': 2, 'model_config/classifier_size': 128, 'model_config/activation': 'relu', 'optimizer_config/optimizer': 'Momentum', 'optimizer_config/nesterov': True, 'optimizer_config/lr': 0.01, 'optimizer_config/weight_decay': 0, 'optimizer_config/decay_gamma': 0.95, 'vocabulary': '<code2seq.data.vocabulary.Vocabulary object at 0x7f7c48611320>', 'teacher_forcing': 1.0}
2021-12-02 03:46:32,768 INFO    MainThread:14123 [wandb_run.py:_atexit_cleanup():1786] got exitcode: 1
2021-12-02 03:46:32,768 INFO    MainThread:14123 [wandb_run.py:_restore():1758] restore
2021-12-02 03:46:35,376 INFO    MainThread:14123 [wandb_run.py:_wait_for_finish():1916] got exit ret: 
2021-12-02 03:46:35,481 INFO    MainThread:14123 [wandb_run.py:_wait_for_finish():1916] got exit ret: 
2021-12-02 03:46:35,586 INFO    MainThread:14123 [wandb_run.py:_wait_for_finish():1916] got exit ret: done: true
exit_result {
}
local_info {
}
  1. For dataset, I see POJ format starts with number (class number probably) and then tokens for each class in all train.c2s, test.c2s, val.c2s as opposed to dataset we used based on code2seq repository here as it has only tokens without numbers as your said @SpirinEgor:

image

  1. So it seems the dataset based on code2seq repository here that is Python Parsed ASTs in JSON format taken only from method names. Are there datasets for Python and Java in your format please? If yes, I assume you used Python and Java code2seq extractors to get train.c2s, test.c2s, val.c2s files to train your model or your used another extractor please to get train.c2s, test.c2s, val.c2s files please? If you don't have, then we could use your POJ dataset for Java classes you used to train your model, but if you don't have a dataset in your format for Python, any suggestion please how to get one for Python similar to POJ dataset for Java you have?

  2. For preparing data from scratch, I don't know what was the format of POJ before running extractor, i.e., in code2seq dataset here, they have one JSON object per line for each file. In your case, each file should have only one class, then we run AST extractor and store the result in a row for each file (java/python) that has one class, then run Python or Java extractors based on language please? I then have to add the counts of each node for each file, etc. This is like building extractor from scratch. Can we use the one you used please for to prepare POJ dataset for Python or it only works for Java if you have one?

SpirinEgor commented 2 years ago

You struggle with the same issue again, since you set up max_label_parts: 1 model treats your data like a classification set up, and therefore, are expected to have all classes in vocabulary. But since you restrict vocabulary with 10 occurrences, it's logical to see this error for method names not from the vocabulary.

2-4: I really don't understand your intentions about model training. If your target is predicting method names, then you should choose the code2seq model with the corresponding config set up. But if you want to train classifiers or embeddings then you should prepare your data properly, not by extracting method names.

Avv22 commented 2 years ago

@SpirinEgor.

Thank you. We are looking for context vectors only for both Python and Java. We ran code2seq successfully, but we don't want method names but context vector embeddings that should be a representative of the whole file.

Context vectors for Java: for your POJ dataset, I assume it's for Java. So we run your model on POJ dataset with your configuration and it's not training. This is your code2class yaml configuration file you wrote without changing it:

data_folder: 

checkpoint: null

seed: 7
# Training in notebooks (e.g. Google Colab) may crash with too small value
progress_bar_refresh_rate: 1
print_config: true

wandb:
  project: Code2Class -- poj-104
  group: null
  offline: full

data:
  url: https://s3.eu-west-1.amazonaws.com/datasets.ml.labs.aws.intellij.net/poj-104/poj-104-code2seq.tar.gz
  num_workers: 0
  max_labels: null
  max_label_parts: 1
  tokens_count: 190000
  labels_count: 19158
  max_token_parts: 5
  path_length: 9

  max_context: 200
  random_context: true

  batch_size: 512
  test_batch_size: 768

model:
  # Encoder
  embedding_size: 128
  encoder_dropout: 0.25
  encoder_rnn_size: 128
  use_bi_rnn: true
  rnn_num_layers: 1
  decoder_num_layers: 1
  encoder_num_layers: 1
  decoder_size: 320
  rnn_dropout: 0.5

  # Classifier
  classifier_layers: 2
  classifier_size: 128
  activation: relu

optimizer:
  optimizer: "Momentum"
  nesterov: true
  lr: 0.01
  weight_decay: 0
  decay_gamma: 0.95

train:
  n_epochs: 10
  patience: 10
  clip_norm: 5
  teacher_forcing: 1.0
  val_every_epoch: 1
  save_every_epoch: 1
  log_every_n_steps: 10

This is the error we got given we did not change your code2class.yaml configuration file:

(scikit-dev) osboxes@osboxes:~/code2class-master$ python code2seq/code2seq_wrapper.py train -c config/code2class-poj104.yaml

model                 | data                                                                                                      | train                 | optimizer          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
embedding_size: 128   | url: https://s3.eu-west-1.amazonaws.com/datasets.ml.labs.aws.intellij.net/poj-104/poj-104-code2seq.tar.gz | n_epochs: 10          | optimizer: Momentum
encoder_dropout: 0.25 | num_workers: 0                                                                                            | patience: 10          | nesterov: True     
encoder_rnn_size: 128 | max_labels: None                                                                                          | clip_norm: 5          | lr: 0.01           
use_bi_rnn: True      | max_label_parts: 1                                                                                        | teacher_forcing: 1.0  | weight_decay: 0    
rnn_num_layers: 1     | tokens_count: 190000                                                                                      | val_every_epoch: 1    | decay_gamma: 0.95  
decoder_num_layers: 1 | labels_count: 19158                                                                                       | save_every_epoch: 1   |                    
encoder_num_layers: 1 | max_token_parts: 5                                                                                        | log_every_n_steps: 10 |                    
decoder_size: 320     | path_length: 9                                                                                            |                       |                    
rnn_dropout: 0.5      | max_context: 200                                                                                          |                       |                    
classifier_layers: 2  | random_context: True                                                                                      |                       |                    
classifier_size: 128  | batch_size: 512                                                                                           |                       |                    
activation: relu      | test_batch_size: 768                                                                                      |                       |                    
Global seed set to 7
wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id 26gqq6iy.
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Dataset is already downloaded
┏━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃   ┃ Name               ┃ Type                     ┃ Params ┃
┡━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0 │ _Code2Seq__metrics │ MetricCollection         │      0 │
│ 1 │ _encoder           │ PathEncoder              │  440 K │
│ 2 │ _decoder           │ Decoder                  │  885 K │
│ 3 │ _Code2Seq__loss    │ SequenceCrossEntropyLoss │      0 │
└───┴────────────────────┴──────────────────────────┴────────┘
Trainable params: 1.3 M                                                                                                                                                                                     
Non-trainable params: 0                                                                                                                                                                                     
Total params: 1.3 M                                                                                                                                                                                         
Total estimated model params size (MB): 5                                                                                                                                                                   
  0%|          | 0/8072 [00:00<?, ?it/s]
 28%|##8       | 2262/8072 [00:00<00:00, 22611.45it/s]
 56%|#####6    | 4524/8072 [00:00<00:00, 20550.21it/s]
 82%|########1 | 6592/8072 [00:00<00:00, 19220.33it/s]
100%|##########| 8072/8072 [00:00<00:00, 19658.26it/s]

/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py:112: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a
bottleneck. Consider increasing the value of the `num_workers` argument` (try 3 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Validation Sanity Check ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/2 0:00:03 • -:--:-- 0.00it/s  
Traceback (most recent call last):
  File "code2seq/code2seq_wrapper.py", line 55, in <module>
    train_code2seq(__config)
  File "code2seq/code2seq_wrapper.py", line 34, in train_code2seq
    train(code2seq, data_module, config)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/utils/train.py", line 55, in train
    trainer.fit(model=model, datamodule=data_module)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 738, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
    self._dispatch()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1284, in run_stage
    return self._run_train()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1306, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1370, in _run_sanity_check
    self._evaluation_loop.run()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 109, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 140, in run
    self.on_run_start(*args, **kwargs)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 86, in on_run_start
    self._dataloader_iter = _update_dataloader_iter(data_fetcher, self.batch_progress.current.ready)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/loops/utilities.py", line 121, in _update_dataloader_iter
    dataloader_iter = enumerate(data_fetcher, batch_idx)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/utilities/fetching.py", line 199, in __iter__
    self.prefetching(self.prefetch_batches)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/utilities/fetching.py", line 258, in prefetching
    self._fetch_next_batch()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/pytorch_lightning/utilities/fetching.py", line 300, in _fetch_next_batch
    batch = next(self.dataloader_iter)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/path_context_dataset.py", line 50, in __getitem__
    label = self.tokenize_class(raw_label, self._vocab.label_to_id)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/path_context_dataset.py", line 66, in tokenize_class
    return [vocab[raw_class]]
KeyError: '12'

wandb: Waiting for W&B process to finish, PID 3204... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/osboxes/code2class-master/wandb/offline-run-20211203_182746-26gqq6iy
wandb: Find logs at: ./wandb/offline-run-20211203_182746-26gqq6iy/logs/debug.log
wandb: 

If we lower the counts:

data:
  url: https://s3.eu-west-1.amazonaws.com/datasets.ml.labs.aws.intellij.net/poj-104/poj-104-code2seq.tar.gz
  num_workers: 0
  max_labels: null
  max_label_parts: 1
  tokens_count: 10
  labels_count: 10
  max_token_parts: 5
  path_length: 9

  max_context: 200
  random_context: true

  batch_size: 512
  test_batch_size: 768

Error we got:

(scikit-dev) osboxes@osboxes:~/code2class-master$ python code2seq/code2seq_wrapper.py train -c config/code2class-poj104.yaml

model                 | data                                                                                                      | train                 | optimizer          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
embedding_size: 128   | url: https://s3.eu-west-1.amazonaws.com/datasets.ml.labs.aws.intellij.net/poj-104/poj-104-code2seq.tar.gz | n_epochs: 10          | optimizer: Momentum
encoder_dropout: 0.25 | num_workers: 0                                                                                            | patience: 10          | nesterov: True     
encoder_rnn_size: 128 | max_labels: None                                                                                          | clip_norm: 5          | lr: 0.01           
use_bi_rnn: True      | max_label_parts: 1                                                                                        | teacher_forcing: 1.0  | weight_decay: 0    
rnn_num_layers: 1     | tokens_count: 10                                                                                          | val_every_epoch: 1    | decay_gamma: 0.95  
decoder_num_layers: 1 | labels_count: 10                                                                                          | save_every_epoch: 1   |                    
encoder_num_layers: 1 | max_token_parts: 5                                                                                        | log_every_n_steps: 10 |                    
decoder_size: 320     | path_length: 9                                                                                            |                       |                    
rnn_dropout: 0.5      | max_context: 200                                                                                          |                       |                    
classifier_layers: 2  | random_context: True                                                                                      |                       |                    
classifier_size: 128  | batch_size: 512                                                                                           |                       |                    
activation: relu      | test_batch_size: 768                                                                                      |                       |                    
Traceback (most recent call last):
  File "code2seq/code2seq_wrapper.py", line 55, in <module>
    train_code2seq(__config)
  File "code2seq/code2seq_wrapper.py", line 29, in train_code2seq
    data_module = PathContextDataModule(config.data_folder, config.data)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/path_context_data_module.py", line 28, in __init__
    self._vocabulary = self.setup_vocabulary()
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/path_context_data_module.py", line 49, in setup_vocabulary
    return Vocabulary(vocabulary_path, self._config.labels_count, self._config.tokens_count, self._is_class)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/code2seq/data/vocabulary.py", line 18, in __init__
    super().__init__(vocabulary_file, labels_count, tokens_count)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/commode_utils/vocabulary.py", line 33, in __init__
    labels = self._extract_tokens_by_count(self._counters[self.LABEL], labels_count)
  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/commode_utils/vocabulary.py", line 65, in _extract_tokens_by_count
    border = [i for i, c in enumerate(counts) if c < count_border][0]
IndexError: list index out of range

This is from your POJ dataset for least common token and label, so how we setupup your code2class.yaml file to run your POJ data please: image

I tried to set the counts based minimum values above, but still does not work:


data:
  url: https://s3.eu-west-1.amazonaws.com/datasets.ml.labs.aws.intellij.net/poj-104/poj-104-code2seq.tar.gz
  num_workers: 0
  max_labels: null
  max_label_parts: 1
  tokens_count: 300
  labels_count: 1
  max_token_parts: 5
  path_length: 9

  max_context: 200
  random_context: true

  batch_size: 512
  test_batch_size: 768

Error:

  File "/home/osboxes/miniconda3/envs/scikit-dev/lib/python3.6/site-packages/commode_utils/vocabulary.py", line 65, in _extract_tokens_by_count
    border = [i for i, c in enumerate(counts) if c < count_border][0]
IndexError: list index out of range
Avv22 commented 2 years ago

@SpirinEgor. Another thing Egor, I used your astminer tool to extract cod2seq path contexts to train your code2class model. Below is a sample of input data for code2class model:

image

Then in code2class.yaml file, I set the configuration as follows:

data_folder: dataset

checkpoint: null

seed: 7
# Training in notebooks (e.g. Google Colab) may crash with too small value
progress_bar_refresh_rate: 1
print_config: true

wandb:
  project: Code2Class -- poj-104
  group: null
  offline: full

data:
  url: 
  num_workers: 5
  max_labels: null
  max_label_parts: null
  tokens_count: 190000
  labels_count: 19158
  max_token_parts: 5
  path_length: 9

  max_context: 200
  random_context: true

  batch_size: 512
  test_batch_size: 768

model:
  # Encoder
  embedding_size: 128
  encoder_dropout: 0.25
  encoder_rnn_size: 128
  use_bi_rnn: true
  rnn_num_layers: 1
  decoder_num_layers: 1
  encoder_num_layers: 1
  decoder_size: 320
  rnn_dropout: 0.5

  # Classifier
  classifier_layers: 2
  classifier_size: 128
  activation: relu

optimizer:
  optimizer: "Momentum"
  nesterov: true
  lr: 0.01
  weight_decay: 0
  decay_gamma: 0.95

train:
  n_epochs: 10
  patience: 10
  clip_norm: 5
  teacher_forcing: 1.0
  val_every_epoch: 1
  save_every_epoch: 1
  log_every_n_steps: 10

Output: model is running for 10 epochs:

image

Question: what would the model do in this case once the training finishes please based on data passed above and based on configuration where we set as you see max_labels=null and max_label_parts=null? Can this model be used to predict vector for the input file please? Againt we are not interested in either code classification nor method name prediction.

SpirinEgor commented 2 years ago

The first notice is that you collect data for the code2vec model. As you can see, paths are represented by 3 numbers -- id of start token, id of a path, and id of end token. But our code2seq and code2class implementations work with the following input format: start token, ids of nodes in a path, end token. Of course, your input is valid, but for the model, it means that all tokens consist of one subtoken and all paths contain only one intermediate node. I suggest you mine data again but set up code2seq format in config.

max_labels is deprecated config property. And max_label_parts is unused for the code2class model, it's only needed to determine whether training on sequential data or not.

After the model finishes the training, it will be stored in the checkpoint folder. See wandb/<run-id>/files folder for all checkpoints after each epoch. You can load this checkpoint by running Code2Class.load_from_checkpoint(<path>) and then use the forward method to get predictions on new data.

Avv22 commented 2 years ago

Thank you for your response. Lastly, So what would be the prediction of code2class model based on provided input and configuration set above please?

On Mon, Dec 6, 2021, 7:35 AM Egor Spirin @.***> wrote:

The first notice is that you collect data for the code2vec model. As you can see, paths are represented by 3 numbers -- id of start token, id of a path, and id of end token. But our code2seq and code2class implementations work with the following input format: start token, ids of nodes in a path, end token. Of course, your input is valid, but for the model, it means that all tokens consist of one subtoken and all paths contain only one intermediate node. I suggest you mine data again but set up code2seq format in config.

max_labels is deprecated config property and max_label_parts is unused for code2class model.

After the model finishes the training, it will be stored in the checkpoint folder. See wandb//files folder for all checkpoints after each epoch. You can load this checkpoint by running Code2Class.load_from_checkpoint() and then use forward method to get predictions on new data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JetBrains-Research/code2seq/issues/115#issuecomment-986735880, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK5F26VWUSBY6G4SVXQSI6LUPSUY5ANCNFSM5IUMVACA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

SpirinEgor commented 2 years ago

It would be tensor of size [batch size; num classes]. The number of classes is the size of label vocabulary.

Avv22 commented 2 years ago

Since I have no classes in dataset as you see in the previous posts, and I have no labales just code2vec path context, what would the tensor represent please? Can this tensor be helpful for various downstream tasks whether supervised or unsupervised?

On Mon, Dec 6, 2021, 6:33 PM Egor Spirin @.***> wrote:

It would be tensor of size [batch size; num classes]. The number of classes is the size of label vocabulary.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JetBrains-Research/code2seq/issues/115#issuecomment-987365110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK5F26WBH32IINP2AUPYKADUPVB5TANCNFSM5IUMVACA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Avv22 commented 2 years ago

It would be tensor of size [batch size; num classes]. The number of classes is the size of label vocabulary.

@SpirinEgor. Training stopped on code2vec input after 1 epoch as you said because dataset is not laballed. It does not seem your astminer nor code2seq extractors are able to produce labels for each path context as I have to do this by myself. The goal I am looking for is not classification nor method name prediction but only vector embedding for the whole file with all units it has (method, class, etc.). So probably your models/extractors are not suitable in my case?

val
---------------
loss = nan
f1 = 0.0
precision = 0.0
recall = 0.0
chrf = 0.0
Epoch 0    ---------------------------------------  843/844 18:43:41 • 0:00:04 0.30it/s loss: nan v_num: 0wmo f1: 0.0

train           | val
---------------------------------
loss = nan      | loss = nan
f1 = 0.0        | f1 = 0.0
precision = 0.0 | precision = 0.0
recall = 0.0    | recall = 0.0
                | chrf = 0.0
Epoch 0    ---------------------------------------- 844/844 18:43:41 • 0:00:00 0.32it/s loss: nan v_num: 0wmo f1: 0.0
SpirinEgor commented 2 years ago

You can create a new label extractor in astminer. In it, you can specify the way of labeling each file. See documentation for more details, e.g. you can use the filename as a label or parent folder name.

SpirinEgor commented 2 years ago

I close it due to inactivity but feel free to reopen it in case of any questions.

Avv22 commented 2 years ago

@SpirinEgor.

Hello Egor,

I was interested to see if any of your models can extract control flow from source code please for Java and Python. I am not interested in AST anymore. Is there any you published before please?