Fix the labels - Githubissues

yangtcai commented 2 years ago

only

[x] fix dataloader link: #22 fix labels

yangtcai commented 2 years ago

Hi, @williamstark01, I have fixed the problem with the labels, and there still have many none-repeatsequence-parts, should I change the generation stage?

williamstark01 commented 2 years ago

Hey Yantong, the algorithm for selecting the repeats between segments looks solid.

I'm getting an exception though when I run dataloader:

Traceback (most recent call last):
  File "/data/EMBL-EBI/GSoC-Repeat-Identification/Ensembl-Repeat-Identification/dataloader.py", line 229, in <module>
    item = dataset[index]
  File "/data/EMBL-EBI/GSoC-Repeat-Identification/Ensembl-Repeat-Identification/dataloader.py", line 92, in __getitem__
    return self.forward_strand(index)
  File "/data/EMBL-EBI/GSoC-Repeat-Identification/Ensembl-Repeat-Identification/dataloader.py", line 87, in forward_strand
    sample, coordinates = self.transform((sample, coordinates))
  File "/data/.pyenv/versions/3.9.12/envs/repeat_identification/lib/python3.9/site-packages/torchvision/transforms/transforms.py", line 95, in __call__
    img = t(img)
  File "/data/EMBL-EBI/GSoC-Repeat-Identification/Ensembl-Repeat-Identification/dataloader.py", line 138, in __call__
    target_tensor = torch.tensor(target_array, dtype=torch.float32)
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Could you take a look?

Also, remember to format the code with Black, you forgot this time :) (I think you can get a plugin to do this automatically at VS Code).

williamstark01 commented 2 years ago

Hi, @williamstark01, I have fixed the problem with the labels, and there still have many none-repeatsequence-parts, should I change the generation stage?

What do you mean? Are there a lot of segments that don't contain repeats?

yangtcai commented 2 years ago

Hi, @williamstark01, I forget to change dtype, now I have fixed the errors, PTAL. Currently, the total segments containing repeat sequence is only 8769. Compare with the whole number of segments 165970, the segments with repeat sequences occupy a little of the whole segments. I think there is a simple way to solve it, we can remove them when in the training stage. This solution will be very easy, but not fast. The alternative solution is that we can change the generation datasets stage, when we generate the datasets, we can remove the redundancy of the segments without repeat sequences. I think this way may be fast than the first solution because it only contains the segments with repeat sequences. What do you think of it?

williamstark01 commented 2 years ago

Looks good now!

Great to see you thought about the problem in detail. In general, it's not ideal to hardcode assumptions in the dataset generation stage, as it reduces the flexibility to testing different ideas later on. That increases the complexity of the item generation during training, making it slower, as you say. Using workers for item batch generation though solves this problem, so we get the best of both options, making training fast again. (Workers generate the next batch in a separate process, so the main process running training isn't affected at all.)

We can go even beyond that, and separate the repeat annotations hg38.hits file to one file per chromosome but do no other filtering in the dataset generation stage. Then, at the start of training use pandas to filter LTR, or other type of repeat types when the Dataset object is created, and select the segments with the appropriate repeat during training. Again, with worker processes training will still be fast.

Does that make sense?

That said, if you strongly prefer to handle this in the dataset generation stage, go for it. For the first prototype, this is just a small implementation detail, not too important how we handle it.

yangtcai commented 2 years ago

GOT IT!!!! I will follow your advice and complete the training stage :D.

EnsemblGSOC / Ensembl-Repeat-Identification

Fix the labels #23