aertslab / create_cisTarget_databases

Create cisTarget databases
43 stars 8 forks source link

The 'zsync' files of databases file might be incorrect. #49

Open NirvanaCh opened 6 months ago

NirvanaCh commented 6 months ago

I'm sorry for submitting an issue here. I tried to download these databases using zsync.

https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.scores.feather

Pay attention to the SHA-1 checksum.

$ sha1sum hg38_screen_v10_clust.regions_vs_motifs.scores.feather
57b58cbc57002e2b96f4b51d6a9fec0e831abd29  hg38_screen_v10_clust.regions_vs_motifs.scores.feather

$ wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.scores.feather.zsync
--2024-05-09 09:16:55--  https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.scores.feather.zsync
Resolving resources.aertslab.org (resources.aertslab.org)... 198.18.0.18
Connecting to resources.aertslab.org (resources.aertslab.org)|198.18.0.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61006451 (58M)
Saving to: ‘hg38_screen_v10_clust.regions_vs_motifs.scores.feather.zsync’

hg38_screen_v10_clust.reg 100%[===================================>]  58.18M  11.1MB/s    in 7.2s

2024-05-09 09:17:03 (8.11 MB/s) - ‘hg38_screen_v10_clust.regions_vs_motifs.scores.feather.zsync’ saved [61006451/61006451]

$ head hg38_screen_v10_clust.regions_vs_motifs.scores.feather.zsync
Blocksize: 2048
Filename: hg38_screen_v10_clust.regions_vs_motifs.scores.feather
Hash-Lengths: 2,3,6
Length: 13882267648
MTime: Thu, 07 Jul 2022 14:31:02 +0000
SHA-1: 57b58cbc57002e2b96f4b51d6a9fec0e831abd29
URL: https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.scores.feather
zsync: 2.0.0-alpha-1

��d��   W3�����VVGO�m��

$ wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.scores.feather.sha1sum.txt
--2024-05-09 09:25:49--  https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.scores.feather.sha1sum.txt
Resolving resources.aertslab.org (resources.aertslab.org)... 198.18.0.18
Connecting to resources.aertslab.org (resources.aertslab.org)|198.18.0.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 97 [text/plain]
Saving to: ‘hg38_screen_v10_clust.regions_vs_motifs.scores.feather.sha1sum.txt’

hg38_screen_v10_clust.reg 100%[===================================>]      97  --.-KB/s    in 0s

2024-05-09 09:25:50 (76.9 MB/s) - ‘hg38_screen_v10_clust.regions_vs_motifs.scores.feather.sha1sum.txt’ saved [97/97]

$ cat hg38_screen_v10_clust.regions_vs_motifs.scores.feather.sha1sum.txt
07b5e527d2ed082e081e439e68dffa77b5f6129c  hg38_screen_v10_clust.regions_vs_motifs.scores.feather

As you can see, its SHA-1 value matches the one recorded in the 'zsync' file's header, but differs from the one recorded in 'sha1sum.txt'.

I hope it's not my fault, as redownloading is a bit of a hassle.

NirvanaCh commented 6 months ago

ranking database downloaded

$ cat hg38_screen_v10_clust.regions_vs_motifs.rankings.feather.sha1sum.txt
1688a925f22d312769798258d990f13866bb4924  hg38_screen_v10_clust.regions_vs_motifs.rankings.feather

$ head hg38_screen_v10_clust.regions_vs_motifs.rankings.feather.zsync
Blocksize: 2048
Filename: hg38_screen_v10_clust.regions_vs_motifs.rankings.feather
Hash-Lengths: 2,3,6
Length: 35192956928
MTime: Thu, 07 Jul 2022 14:35:59 +0000
SHA-1: 95c823ee1e19f68ce0c82f79042cdc1007018ddb
URL: https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.rankings.feather
zsync: 2.0.0-alpha-1

�W�inX�H1�ƤM)�s3���␦
                    �.�t�4��eDb�D��>�P�_�����C�е�C�G�o�e����t=�r��?i����i���X{�^�O#�5�L��څq�Kr��D�!S9�ۢ�I}����w�      �{3�U^�u��3L���������D4��.>5c)�4a�B��r�ZD�C��_����˃����a�"��2#v/��[D�Z���,�

$ sha1sum hg38_screen_v10_clust.regions_vs_motifs.rankings.feather
95c823ee1e19f68ce0c82f79042cdc1007018ddb  hg38_screen_v10_clust.regions_vs_motifs.rankings.feather
NirvanaCh commented 6 months ago

An error occurred :

ValueError: "/m/tutor/database/hg38_screen_v10_clust.regions_vs_motifs.rankings.feather" is not a cisTarget Feather database in Feather v1 or v2 format.

ctxcore/ctdb.py :

......
def is_feather_v1_or_v2(feather_filename: Union[Path, str]) -> Optional[int]:
    """
    Check if the passed filename is a Feather v1 or v2 file.

    :param feather_filename: Feather v1 or v2 filename.
    :return: 1 (for Feather version 1), 2 (for Feather version 2) or None.
    """

    with open(feather_filename, "rb") as fh_feather:
        # Read first 6 and last 6 bytes to see if we have a Feather v2 file.
        fh_feather.seek(0, 0)
        feather_v2_magic_bytes_header = fh_feather.read(6)
        fh_feather.seek(-6, 2)
        feather_v2_magic_bytes_footer = fh_feather.read(6)

        if feather_v2_magic_bytes_header == feather_v2_magic_bytes_footer == b"ARROW1":
            # Feather v2 file.
            return 2

        # Read first 4 and last 4 bytes to see if we have a Feather v1 file.
        feather_v1_magic_bytes_header = feather_v2_magic_bytes_header[0:4]
        feather_v1_magic_bytes_footer = feather_v2_magic_bytes_footer[2:]

        if feather_v1_magic_bytes_header == feather_v1_magic_bytes_footer == b"FEA1":
            # Feather v1 file.
            return 1

    # Some other file format.
    return None
......
$ head -c 6 hg38_screen_v10_clust.regions_vs_motifs.*
==> hg38_screen_v10_clust.regions_vs_motifs.rankings.feather <==
ARROW1
==> hg38_screen_v10_clust.regions_vs_motifs.scores.feather <==
ARROW1

$ tail -c 6 hg38_screen_v10_clust.regions_vs_motifs.*
==> hg38_screen_v10_clust.regions_vs_motifs.rankings.feather <==
��
==> hg38_screen_v10_clust.regions_vs_motifs.scores.feather <==
00176-
NirvanaCh commented 6 months ago

The file size is incorrect.

$ stat hg38_screen_v10_clust.regions_vs_motifs.*.feather
  File: hg38_screen_v10_clust.regions_vs_motifs.rankings.feather
  Size: 35192956928     Blocks: 68736272   IO Block: 4096   regular file
Device: 807h/2055d      Inode: 18643438    Links: 1
Access: (0777/-rwxrwxrwx)  Uid: ( 1001/ charles)   Gid: ( 1001/ charles)
Access: 2024-05-09 10:10:29.183467890 +0800
Modify: 2022-07-07 14:35:59.000000000 +0800
Change: 2024-05-09 10:10:03.311709805 +0800
 Birth: 2024-05-08 21:57:40.146629410 +0800
  File: hg38_screen_v10_clust.regions_vs_motifs.scores.feather
  Size: 13882267648     Blocks: 27113824   IO Block: 4096   regular file
Device: 807h/2055d      Inode: 18643440    Links: 1
Access: (0777/-rwxrwxrwx)  Uid: ( 1001/ charles)   Gid: ( 1001/ charles)
Access: 2024-05-09 10:48:38.146833263 +0800
Modify: 2024-05-08 23:28:39.283831255 +0800
Change: 2024-05-09 10:10:03.311709805 +0800
 Birth: 2024-05-08 21:57:43.862621727 +0800

$ curl -I https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.rankings.feather
HTTP/1.1 200 OK
Date: Thu, 09 May 2024 03:52:22 GMT
Server: Apache/2.4.29 (Ubuntu)
Strict-Transport-Security: max-age=15768000
Last-Modified: Thu, 07 Jul 2022 14:35:59 GMT
ETag: "831a9eca2-5e338010f31c0"
Accept-Ranges: bytes
Content-Length: 35192958114
X-Frame-Options: sameorigin

$ curl -I https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.scores.feather
HTTP/1.1 200 OK
Date: Thu, 09 May 2024 03:56:51 GMT
Server: Apache/2.4.29 (Ubuntu)
Strict-Transport-Security: max-age=15768000
Last-Modified: Thu, 07 Jul 2022 14:31:02 GMT
ETag: "33b729822-5e337ef5b5580"
Accept-Ranges: bytes
Content-Length: 13882267682
X-Frame-Options: sameorigin

So the ’zsync‘ files is incorrect.

NirvanaCh commented 6 months ago

I fixed it using ‘curl -C -’

$ curl -C - -O https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.rankings.feather** Resuming transfer from byte position 35192956928
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--100  1186  100  1186    0     0    989      0  0:00:01  0:00:01 --:--:--   989

$ curl -C - -O https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.scores.feather
** Resuming transfer from byte position 13882267648
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--100    34  100    34    0     0     29      0  0:00:01  0:00:01 --:--:--    29

$ tail -c 6 hg38_screen_v10_clust.regions_vs_motifs.*feather
==> hg38_screen_v10_clust.regions_vs_motifs.rankings.feather <==
ARROW1
==> hg38_screen_v10_clust.regions_vs_motifs.scores.feather <==
ARROW1

It looks like it's working now.

To summarize,

the 'zsync' files are incorrect

Best wishes

ghuls commented 3 months ago

zsync files are removed for now as zsync was having issues with big files (larger than 2G) for a long time.

Looks like the zsync2 bug: https://github.com/AppImageCommunity/zsync2/issues/31 might finally be resolved in a fork of zsync2: https://github.com/NiLuJe/zsync2/commit/a8e2d68e3f03315835f6d6fb9f74a26c3ea000b9