dobkeratops / convnet

2 stars 1 forks source link

What about trying to dropout from the actual image channels as well? Also applying image augmentations. "#never dropout the actual input image channels!" #2

Open Twenkid opened 2 years ago

Twenkid commented 2 years ago

https://github.com/dobkeratops/convnet/blob/185780c299a90d4514e05dada6efb3e2a58dc0c9/pytorch/autoencoder.py#L185

Should this be a strict requirement? Dropout in the input image is used for image augmentation. https://brunokrinski.github.io/awesome-data-augmentation/ image

image (Here they call it "cutout")

I think if there is dropout in the input, while the output is preserved, that may force the model to "imagine" something to fill-in the gap or/and to find more distant internal correlations.

Image augmentation in general is supposed to improve the model's generalisation.

Edit: The noise that is added to input is a kind of augmentation, but there could be more.

dobkeratops commented 2 years ago

Right .. my thinking is that network dropout is an internal training tool , whilst the image augmentation is more deliberate.. eg I’m currently using my own noise algorithm . I suppose in theory having the symmetry - the exact same type of noise throughout has its appeal.

I will also make dropout amount a command line parameter . There’s so science as to why it’s 0.25 there. I figured at 0.5 it might prevent feature cooperation too much. In my head this idea of feature vectors (rather than sparse activations that they pursue in SNNs and all the other things we’ve talked about elsewhere) requires that the features do actually cooperate a bit. Just have to test and measure this.

Ultimately I suspect there’s a pretrained net out there that will do this whole job better but I’m using this to try ideas out and get a feel for all these techniques that I’ve read about .

I definitely want to make the input noise more configurable aswell. I tried fractal style noise as opposed to the usual per pixel noise but I noitice it makes it blur trees out, because they look too much like the fractal noise, lol. it could even switch the input noise randomly .

I think bottlenecking the final layer more might also let it learn without noise. Again I’m trying to make it output multiple versions with at different depths so it can explore to find just the right amount of bottleneck.

there’s something else more serious I’m not happy with here, which is the way it does downsampling everywhere except the last layer. I’d rather each level is the exact same type of block,including the downsample/upsample. I already did one long training run ( 8hours) with this problem.

dobkeratops commented 2 years ago

I may also extend this dataloader to look for groups of images by name, Eg (foo_PREV.jpg. , foo_NEXT.jpg) -> foo_OUTPUT.jpg. ( bar_PREV.jpg,bar_NEXT.jpg) -> bar_OUTPUT.jpg …etc for a completely general way to train a net on multi channel inputs and outputs. Without a dedicated video data loader, another tool could spit out a few frame pairs. (Of course a dedicated video dataloader would be better).
the main use of this will be learning multi channel rendering effects.. for example, a neural full screen lighting aproximation? Feed it z buffer, incident light, normals,prev frame (from game engine snapshots) plus a raytraced exact lighting solution. Neural shading. There are some papers on that. also show it lit textures, with output being incidental light, normal map and other PBR channels … again I’m sure someone already did this

Twenkid commented 2 years ago

Without a dedicated video data loader, another tool could spit out a few frame pairs. (Of course a dedicated video dataloader would be better).

I could develop the video loading etc. part. Simple video loading is a few lines of code:

https://github.com/Twenkid/ComputerVision_Pyimagesearch_OpenCV_Dlib_OCR-Tesseract-DL/blob/master/pyimage/play.py

A simplified example:

import cv2
stream = cv2.VideoCapture(path)
while True:
    (grabbed, frame) = stream.read() 
    if not grabbed:
        break
    cv2.imshow("Frame", frame)
    cv2.waitKey(1) #cv2 windowing needs a delay otherwise the window is not updated and stays blank
stream.release()
cv2.destroyAllWindows()

I guess we'd store it in buffers and it may run in a thread, or a process in Py.

One example that may come handy eventually, capturing multiple videos from Youtube at once with multithreading: https://github.com/Twenkid/Twenkid-FX-Studio/blob/master/Py/YoutubeAggregatorPafy/y6.py

E.g. it could run on clips from youtube without downloading the videos and somebody manually feeding it. Also it could "watch" live streams.

dobkeratops commented 2 years ago

using a process could be pretty interesting, open ended plug and play dataloaders. and watching live streams = endless datasets. nice idea . You could also randomly interleave multiple streams for hugely varied, endless input. Maybe choose between them based on which stream is showing the most change ..

Twenkid commented 2 years ago

Right .. my thinking is that network dropout is an internal training tool , whilst the image augmentation is more deliberate.. eg I’m currently using my own noise algorithm .

Yes, augmentation may be more focused and targetted at specific modification, e.g. some expected varations of the input such as color shifts, contrast etc.,while the normal dropout is just "black" or noise etc.

Re youtube sreams - Pafy allows to read the metadata etc., read channel's video URLs, playlists etc.

Now that I was reading augmentations literature I realized again its function in a more general scope, it is like a chopy/discontinuous replacement of interactive learning, feeding gradually changing input etc. The "digested" version in the model "consumes", generalises the correlations between all these "pseudo interactive" learning.

Also, for other mechanisms, more symbolic, and/or for SuperCogAlg eventually, a set of augmented images with the same label suggest to the algorithm what's the most important from the image. In CogAlg that would be the contours etc.

In general, a "smarter" model should take not only the pixel-version of the augmented image, but also the operation, to have it as a parameter, and learn on that too.

Twenkid commented 2 years ago

I will also make dropout amount a command line parameter . There’s so science as to why it’s 0.25 there. I figured at 0.5 it might prevent feature cooperation too much. In my head this idea of feature vectors (rather than sparse activations that they pursue in SNNs and all the other things we’ve talked about elsewhere) requires that the features do actually cooperate a bit. Just have to test and measure this.

You mean in the sparse version they are more separated, in the dense - they are intermixed and interact/dependent? (Like in transforemrs etc.) I've noticed in models I've read dropout usually 0.1, 0.2, but I don't know how they justify it logically beyond experimentally.

Ultimately I suspect there’s a pretrained net out there that will do this whole job better but I’m using this to try ideas out and get a feel for all these techniques that I’ve read about .

I don't now if it is for your exact goal, bt they use for example the Imagenet weights of VGG-19 and fine-tune it for new tasks, such as StyleTransfer or faster training for recognition. https://towardsdatascience.com/tensorflow-and-vgg19-can-help-you-convert-your-photos-into-beautiful-pop-art-pieces-c1abe87e7e01

A project of autoencoder that uses VGG-16 (Keras) https://github.com/anikita/ImageNet_Pretrained_Autoencoder

I definitely want to make the input noise more configurable aswell. I tried fractal style noise as opposed to the usual per pixel noise but I noitice it makes it blur trees out, because they look too much like the fractal noise, lol. it could even switch the input noise randomly .

BTW, mentioning to what that noise looks like remnds me of an idea: model-branching. Recognizers, visual transformers, maybe converting to "tokens"/codes; branching to smaller models, then integration or something.

I think bottlenecking the final layer more might also let it learn without noise. Again I’m trying to make it output multiple versions with at different depths so it can explore to find just the right amount of bottleneck.

there’s something else more serious I’m not happy with here, which is the way it does downsampling everywhere except the last layer. I’d rather each level is the exact same type of block,including the downsample/upsample. I already did one long training run ( 8hours) with this problem.

You want each block to be the same? I guess more interesting/original result could happen with some variations, residual blocks from somewhere etc.

Also if there are several models and encodings and some of their layers or embeddings get concatenated etc. (like with the Res. blocks, with text-to-image etc.).

dobkeratops commented 2 years ago

You want each block to be the same? I guess more interesting/original result could happen with some variations, residual blocks from somewhere etc.

yes. One thing I did agree with from CogAlg discussions, I think this idea appeared there .. "if every layer is being treated the same - you can extend indefinitely" - the general idea of these deep autoencoders should be extendable by stacking more layers. each week you could deepen , broaden it..

thats also inline with your notion of letting dropout do the straightforward input noise (seems worth trying..)

I don't now if it is for your exact goal, bt they use for example the Imagenet weights of VGG-19 and fine-tune it for new tasks, such as StyleTransfer or faster training for recognition.

I will definitely need to try that, I hear about that model a lot. Can a 'reconstructor' be retrofitted? if its being used for style transfer evidently "yes". I'll also continue to see how far I can get with my own smaller "homebrew" nets - maybe they will be enough for indie game upscaling, maybe they'll be easier to squeeze into a steam deck in realtime. we could train nets for specific games. the imagenet winning entries tend to be very deep, I'd worry a bit about the latency for ingame use. But perhaps it would still be better to start with these established nets and approximate them (try to keep feature vectors at certain layers compatible, such that they can be swapped back for higher res versions later)

dobkeratops commented 2 years ago

(I've got some ideas to try out for net shape aswell, this one is a variation on the 'densenet' idea, feature reuse, but at multiple scales - it would simultaneously combine input starting with the same image reduced to several 'mipmap levels'. no idea how well would work yet.) desnse_multiscale_feature_reuse

dobkeratops commented 2 years ago

https://github.com/Twenkid/ComputerVision_Pyimagesearch_OpenCV_Dlib_OCR-Tesseract-DL/blob/master/pyimage/play.py E.g. it could run on clips from youtube without downloading the videos and somebody manually feeding it. Also it could "watch" live streams.

Even without the idea of feeding in realtime - a tool which could watch a list of channels and fill a directory to a certain size with "the most varied batch of frames" could be very useful

Twenkid commented 2 years ago

Even without the idea of feeding in realtime

Yes, I also realized that it couldn't train on each frame in real time... But there could be some recognition in real time and turning on recording when detecting something etc.

yes. One thing I did agree with from CogAlg discussions, I think this idea appeared there .. "if every layer is being treated the same - you can extend indefinitely" - the general idea of these deep autoencoders should be extendable by stacking more layers. each week you could deepen , broaden it..

I also agree with the idea of having blocks/modules which are extendable, but it could be more general, like a "block" (not a layer or a stack with fixed architecture), and one created by a generative process and having additional parameters. A whole process/circle/system being a block and and also variable. Yes, it should have proper "interface" and possibility to get extended.

In CogAlg, when the feedback is complete, the layer/module is supposed to generate the following one etc.

In our case, that process could be some representation beyond the concrete conv-operations etc., which generates and adjusts them.

thats also inline with your notion of letting dropout do the straightforward input noise (seems worth trying..)

BTW, one difference I realized between dropout and adding noise to the whole image. Dropout adds the noise to a portion of regions in the image/layer, while the noise could be on the whole image and it changes the low-level statistics of all pixels.

I will definitely need to try that, I hear about that model a lot. Can a 'reconstructor' be retrofitted? if its being used for style transfer evidently "yes". I'll also continue to see how far I can get with my own smaller "homebrew" nets - maybe they will be enough for indie game upscaling, maybe they'll be easier to squeeze into a steam deck in realtime. we could train nets for specific games. the imagenet winning entries tend to be very deep, I'd worry a bit about the latency for ingame use. But perhaps it would still be better to start with these established nets and approximate them (try to keep feature vectors at certain layers compatible, such that they can be swapped back for higher res versions later)

Re imagenet aspect - yes, it's trained on images which are very different than a dataset of retro games, the limited color palette, high contrast, blocky artifacts, more repetitive patterns/tiles etc.

I think one thing we need is preparing some specific dataset(s). It's possible to pretrain some imagenet-like model on such a dataset.

Also re the utilisation of the peculiarities of the retro graphics and eventually working in tile-space and sprite-space: one idea:

  1. Ensembles of multiple smaller models, trained on patches.
  2. Recognizer/detector, which then applies a specific smaller model
  3. Reconstruct Stitcher/Mixer/ ...

That's more complex though and there are details to get clarified and more experience with the libraries.

I'd worry a bit about the latency for ingame use. But perhaps it would still be better to start with these established nets and approximate them (try to keep feature vectors at certain layers compatible, such that they can be swapped back for higher res versions later)

Right about both. We need to collect experience and I think the real time speed is not critical for now and the compatibility would allow transfer learning.

Twenkid commented 2 years ago

I started running your autoencoder: image

image

dobkeratops commented 2 years ago

Thanks for trying it out, great to see you managed to run it . I have some ideas to cleanup how it saves , I want to streamline reloading the state to quit and resume training (make it name the files better,etc).

dobkeratops commented 2 years ago

Ensembles of multiple smaller models, trained on patches. Recognizer/detector, which then applies a specific smaller model Reconstruct Stitcher/Mixer/ ...

  • Recognizer/detector, which then applies a specific smaller model

'branched net' idea, I've thought about this aswell, and supposedly this gets called "Mixture Of Experts". (another example is splitting a person detector and a more specialised pose-estimation net)

Writing low level code , it should be possible to write convolutions that do this branching inplace, within the layers.

I see its pretty easy to get at these pretrained models.. I can look at trying to use its features as a universal 'bridge' I'd like to try precalculating all the images turned into its final conv feature map (before the FCC layers) .. and train my own smaller nets to produce the same result

I bet you can do a lot by just trying to compress those weights aswell (eg vector quantize the rows of its convolutions etc)

currently I'm envisaging in game nets that are <1/10th this size ? I must check what various machines can actually handle

Twenkid commented 2 years ago

Thanks for trying it out, great to see you managed to run it . I have some ideas to cleanup how it saves , I want to streamline reloading the state to quit and resume training (make it name the files better,etc).

You're welcome. Yesterday I run it only on the CPU, because the GPU was busy, I had to adjust the paths (Windows). Now that I tried the GPU, I had to make a litte fix in train_epoch():

    for i,(data,target) in enumerate(dataloader):
        data = data.to(device)
        target = target.to(device)

It was just:

data.to(device)
target.to(device)

(If my version of the code was up to date).

On my GPU the version without the back-assignment to the parameters produces an error when trying to run with the CUDA device. It seems a part of data remains in the system RAM, while another part is in the GPU:

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor


Traceback (most recent call last):
  File "Z:\auto\autoencoder.py", line 545, in <module>
    main(sys.argv[1:])
  File "Z:\auto\autoencoder.py", line 537, in main
    train_epoch(device, ae,optimizer[0 if i<200 else 1],  dataloader,progress)
  File "Z:\auto\autoencoder.py", line 445, in train_epoch
    output=model(data)
  File "C:\ProgramData\Miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "Z:\auto\autoencoder.py", line 127, in forward
    return self.eval_all(x)
  File "Z:\auto\autoencoder.py", line 116, in eval_all
    out=self.eval_unet(x)
  File "Z:\auto\autoencoder.py", line 230, in eval_unet
    x=self.activ( self.conv[i](x))
  File "C:\ProgramData\Miniconda3\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\ProgramData\Miniconda3\lib\site-packages\torch\nn\modules\conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\ProgramData\Miniconda3\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

I guess on more modern GPUs there's better uniform memory/addressing and maybe they "care less".

However maybe it could speed up the execution on your GPU as well?

A solution is explained here, although their error is not exactly the same:

https://stackoverflow.com/questions/59013109/runtimeerror-input-type-torch-floattensor-and-weight-type-torch-cuda-floatte

... In my fork (I need to upload it) I also added counter to the progress_ images and turned off img.show(). I thought of making it non-blocking, but PILLOW calls a system viewer etc., I may change it to OpenCV. For now it just saves and one could review each file. That should be adjusted though, it could be a circular list of last N images and separate persistent save on particular intervals etc.

If it's just one image, it can be seen just by opening with a system viewer and reloading it.

image

image

dobkeratops commented 2 years ago

thanks, right that back assignment is a bug. I can fix that aswell or merge a PR perhaps I did ideally want to make the progress viewer a window that updated , haven't found how in python yet. I tend to just use the web view. Haven't looked into "tensorboard" yet. I'll need to expand the way it shows multiple inputs and branch outputs. , option to show more of the batch etc.

Twenkid commented 2 years ago

I added online display with opencv:

python -m pip install opencv-python
import cv2

In visualize_progress, instead of img.show():

display_image = numpy.array(img)
display_image = cv2.cvtColor(display_image,cv2.COLOR_RGB2BGR)
cv2.imshow("PROGRESS", display_image)
dobkeratops commented 2 years ago

https://github.com/dobkeratops/convnet_stuff

damn I remember what happened I broke this repo (I'd tried to push with a big export of a several trained nets- GitHub stalled , and nothing I tried fixed it) so I pushed my latest code to this new repo. but I'll try to update this one aswell now..

this version has the assignment fix, also the web view here auto-updates(it makes an html page that displays the image; that page is setup to auto-refresh. my linux distro happens to serve pages from /var/www/html by default

that method to view via opencv seems nice too, more direct

Twenkid commented 2 years ago

NP, it seems it was like a test: would I catch that bug and would I conclude that blocking-display of the progress image should update automatically. :)