Refactor SIMD code and add support for SSE2 and NEON

Hi from a Martin to another!

I came to work on Opal since it's used in STAR, and I'm trying to improve the portability of some bioinformatics tools to new platforms (namely Aarch64 which is becoming more prevalent with the new Mac M1 chip). Even though it looks you're not in the field anymore, I hope you can find some time to review and merge this PR!

The current way SIMD was handled was not so compatible with NEON because some SSE/AVX intrinsics do not have strict equivalents in NEON, so I removed the macro-based aliasing and made all SIMD implementations in the Simd<T> templates. Doing so also let me add an emulation layer for SSE2 in the event SSE4.1 is not available. In theory, it should even be possible to add a "fake" SIMD implementation with just one lane to be used as a fallback on some other machines that way, but getting NEON support should already cover the vast majority of modern Arm machines.

Since the code was common between Simd<T> and SimdSw<T> (except for the signedness of min/max operations on char vectors), I merged the 6 different template implementations into just 4, and added the remaining SIMD operations like load/store/bitwise-and to the template as well.

Test

The NEON code was tested on the Raspberry Pi 4:

```console $ uname -a Linux chloroplast 5.10.0-17-arm64 #1 SMP Debian 5.10.136-1 (2022-08-13) aarch64 GNU/Linux $ ./test Starting Opal! Using ARM NEON! Time: 7.257904 Maximum: 573 Starting normal! Time: 10.553968 Maximum: 573 Times faster: 1.454134 $ ./opal_aligner -x1 ../test_data/query/O74807.fasta ../test_data/db/uniprot_sprot15.fasta Using SW alignment mode. Reading query fasta file... Read query sequence, 110 residues. Reading database fasta file... Read 15 database sequences, 4491 residues total. Whole database read: 15 database sequences, 4491 residues in total. Comparing query to database... Finished! #: (, ) (, ) #0: 155 (?, ?) (106, 161) #1: 178 (?, ?) (109, 229) #2: 169 (?, ?) (107, 314) #3: 119 (?, ?) (108, 145) #4: 152 (?, ?) (109, 437) #5: 68 (?, ?) (102, 58) #6: 143 (?, ?) (106, 214) #7: 151 (?, ?) (105, 181) #8: 182 (?, ?) (109, 304) #9: 94 (?, ?) (84, 74) #10: 126 (?, ?) (87, 126) #11: 169 (?, ?) (108, 398) #12: 138 (?, ?) (99, 239) #13: 181 (?, ?) (106, 927) #14: 108 (?, ?) (97, 84) Cpu time of searching: 0.01 GCUPS (giga cell updates per second): 0.07 ```

Martinsos / opal

Refactor SIMD code and add support for SSE2 and NEON #32

Test