Closed magicDGS closed 6 years ago
The future format may be something like "extension [-|gzip] offset magic_values". Extending the signature format will break the existing signatures...
Currently, you have to deal with file_gz.c, example
diff -ruw testdisk/src/file_gz.c ../testdisk-7.1-WIP/src/file_gz.c
--- testdisk/src/file_gz.c 2018-03-22 13:20:13.628017471 +0100
+++ ../testdisk-7.1-WIP/src/file_gz.c 2018-04-10 11:22:25.121483486 +0200
@@ -177,6 +177,12 @@
file_recovery_new->min_filesize=22;
file_recovery_new->time=le32(gz->mtime);
file_recovery_new->file_rename=&file_rename_gz;
+ if(memcmp(buffer_uncompr, "BAM\1", 4)==0)
+ {
+ /* https://github.com/samtools/hts-specs SAM/BAM and related high-throughput sequencing file formats */
+ file_recovery_new->extension="bam";
+ return 1;
+ }
if(memcmp(buffer_uncompr, "PVP ", 4)==0)
{
/* php Video Pro */
Thanks @cgsecurity - maybe to keep previous signature files without breaking changes, incorporating a photorec.compress.sig
might be an option to add compressed formats (the second field should contain a valid compression, such gzip, and maybe also others such bz2 and so on). This is quite important for bioinformatics, which most of the formats are compressed.
In addition, it looks like PhotoRec pulls out BAM files as separated gzip
. Could it be possible to detect when several gzip
are concatenated into a block-compressed file (bgzip
)?
Here there are some magic patterns for bioinformatics: https://github.com/lindenb/magic/tree/master/patterns
Maybe that could be implemented in photorec (or an extension of it) to recover bioinformatics data...
Do you mean that PhotoRec recover a single bam file as several gzip files ? If it's the case, please try to reproduce the problem with photorec /d recup_dir /cmd sample.bam search
, if you get several gz files, please share the file sample.
Using this BAM (https://github.com/broadinstitute/gatk/blob/master/src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam) produces the following result with your command (without the fix proposed in your previous comment):
-rw-r--r-- 1 daniel staff 15M Apr 10 12:31 f0000000.gz
-rw-r--r-- 1 daniel staff 4.1M Apr 10 12:31 f0030938.gz
-rw-r--r-- 1 daniel staff 4.4M Apr 10 12:31 f0039252.gz
-rw-r--r-- 1 daniel staff 39M Apr 10 12:31 f0048275.gz
-rw-r--r-- 1 daniel staff 1.4M Apr 10 12:31 f0128504.gz
-rw-r--r-- 1 daniel staff 6.2M Apr 10 12:31 f0131315.gz
-rw-r--r-- 1 daniel staff 5.1M Apr 10 12:31 f0143990.gz
-rw-r--r-- 1 daniel staff 815K Apr 10 12:31 f0154341.gz
-rw-r--r-- 1 daniel staff 3.1K Apr 10 12:31 report.xml
With the fix for BAM files, the only difference is that the f0000000.gz
is re-named to f0000000.bam
.
While recovering some real case data, I am realizing that it is not only a problem with the BAM format, but also with any file compressed with bgzip.
I have start working on handling bgzip:
--- testdisk/src/file_gz.c 2018-03-22 13:20:13.628017471 +0100
+++ testdisk-7.1-WIP/src/file_gz.c 2018-04-10 14:00:00.000000000 +0200
@@ -36,7 +36,6 @@
#include "file_gz.h"
static void register_header_check_gz(file_stat_t *file_stat);
-static int header_check_gz(const unsigned char *buffer, const unsigned int buffer_size, const unsigned int safe_header_only, const file_recovery_t *file_recovery, file_recovery_t *file_recovery_new);
static void file_rename_gz(file_recovery_t *file_recovery);
extern const file_hint_t file_hint_doc;
@@ -59,7 +58,6 @@
uint8_t os;
} __attribute__ ((gcc_struct, __packed__));
-static const unsigned char gz_header_magic[3]= {0x1F, 0x8B, 0x08};
/* flags:
bit 0 FTEXT
bit 1 FHCRC
@@ -76,9 +74,38 @@
#define GZ_FNAME 8
#define GZ_FCOMMENT 0x10
-static void register_header_check_gz(file_stat_t *file_stat)
+static void file_check_bgzf(file_recovery_t *file_recovery)
{
- register_header_check(0, gz_header_magic,sizeof(gz_header_magic), &header_check_gz, file_stat);
+}
+
+static int header_check_bgzf(const unsigned char *buffer, const unsigned char *buffer_uncompr, const unsigned int buffer_size, file_recovery_t *file_recovery_new)
+{
+ const struct gzip_header *gz=(const struct gzip_header *)buffer;
+ reset_file_recovery(file_recovery_new);
+ file_recovery_new->min_filesize=22;
+ file_recovery_new->time=le32(gz->mtime);
+ file_recovery_new->file_rename=&file_rename_gz;
+ file_recovery_new->file_check=&file_check_bgzf;
+ if(memcmp(buffer_uncompr, "BAI\1", 4)==0)
+ {
+ /* https://github.com/samtools/hts-specs SAM/BAM and related high-throughput sequencing file formats */
+ file_recovery_new->extension="bai";
+ return 1;
+ }
+ if(memcmp(buffer_uncompr, "BAM\1", 4)==0)
+ {
+ /* https://github.com/samtools/hts-specs SAM/BAM and related high-throughput sequencing file formats */
+ file_recovery_new->extension="bam";
+ return 1;
+ }
+ if(memcmp(buffer_uncompr, "CSI\1", 4)==0)
+ {
+ /* https://github.com/samtools/hts-specs SAM/BAM and related high-throughput sequencing file formats */
+ file_recovery_new->extension="csi";
+ return 1;
+ }
+ file_recovery_new->extension=file_hint_gz.extension;
+ return 1;
}
static int header_check_gz(const unsigned char *buffer, const unsigned int buffer_size, const unsigned int safe_header_only, const file_recovery_t *file_recovery, file_recovery_t *file_recovery_new)
@@ -86,6 +113,7 @@
unsigned int off=10;
const unsigned int flags=buffer[3];
const struct gzip_header *gz=(const struct gzip_header *)buffer;
+ int bgzf=0;
/* gzip file format:
* a 10-byte header, containing a magic number, a version number and a timestamp
* optional extra headers, such as the original file name,
@@ -106,6 +134,8 @@
{
off+=2;
off+=buffer[10]|(buffer[11]<<8);
+ if(buffer[12]=='B' && buffer[13]=='C' && buffer[14]==2 && buffer[15]==0)
+ bgzf=1;
}
if((flags&GZ_FNAME)!=0)
{
@@ -133,6 +163,11 @@
if(header_ignored_adv(file_recovery, file_recovery_new)==0)
return 0;
}
+ if(file_recovery->file_check==&file_check_bgzf)
+ {
+ header_ignored(file_recovery_new);
+ return 0;
+ }
#if defined(HAVE_ZLIB_H) && defined(HAVE_LIBZ)
{
static const unsigned char schematic_header[12]={ 0x0a, 0x00, 0x09,
@@ -173,6 +208,10 @@
if(d_stream.total_out < 16)
return 0;
buffer_uncompr[d_stream.total_out]='\0';
+ if(bgzf!=0)
+ {
+ return header_check_bgzf(buffer, buffer_uncompr, d_stream.total_out, file_recovery_new);
+ }
reset_file_recovery(file_recovery_new);
file_recovery_new->min_filesize=22;
file_recovery_new->time=le32(gz->mtime);
@@ -291,3 +330,9 @@
return "none";
#endif
}
+
+static void register_header_check_gz(file_stat_t *file_stat)
+{
+ static const unsigned char gz_header_magic[3]= {0x1F, 0x8B, 0x08};
+ register_header_check(0, gz_header_magic,sizeof(gz_header_magic), &header_check_gz, file_stat);
+}
I don't know if bai and csi files are also compressed. Can you work upon this patch ?
Many thanks for looking into this. I will check your patch as soon as possible (I haven't even look into the code, because this is a quick answer). Do you have the patch in your previous comment commited to some repository (either a branch here or in the official one)? That will be useful for me to test with your changes.
In the meantime, some hints about bioinformatic formats to help you working on it:
bgz
extension if the file is detected to be block-compressed and not part of the supported BAM-related files.bam.bai
and bai
, and bam.csi
, csi
, and even cram.csi
), but I recommend to stick to the simpler one (bai
, csi
) to be on the safe side (a csi index might be associated not with a BAM file).For your information and to make easier your progress here, I implemented in my fork (see the branch https://github.com/magicDGS/testdisk/tree/dgs_bioinf_files) some of the common bioinformatics formats that might be compressed. That can be a reference for your fix.
Using the patch I re-run the command for checking the CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam file, and it does not create any file, just the following report.xml:
<?xml version='1.0' encoding='UTF-8'?>
<dfxml xmloutputversion='1.0'>
<metadata
xmlns='http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:type>Carve Report</dc:type>
</metadata>
<creator>
<package>PhotoRec</package>
<version>7.1-WIP</version>
<build_environment>
<compiler>GCC 4.2</compiler>
<library name='libext2fs' version='none'/>
<library name='libewf' version='none'/>
<library name='libjpeg' version='none'/>
<library name='libntfs' version='none'/>
<library name='zlib' version='1.2.5'/>
</build_environment>
<execution_environment>
<os_sysname>Darwin</os_sysname>
<os_release>15.6.0</os_release>
<os_version>Darwin Kernel Version 15.6.0: Tue Jan 9 20:12:05 PST 2018; root:xnu-3248.73.5~1/RELEASE_X86_64</os_version>
<host>i122mc132.vu-wien.ac.at</host>
<arch>x86_64</arch>
<uid>502</uid>
<start_time>2018-04-10T16:12:50+0200</start_time>
</execution_environment>
</creator>
<source>
<image_filename>/Users/daniel/workspaces/gatk_magicdgs/src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam</image_filename>
<sectorsize>512</sectorsize>
<image_size>79856849</image_size>
<volume>
<byte_runs>
<byte_run offset='0' img_offset='0' len='79856849'/>
</byte_runs>
</volume>
</source>
<configuration>
</configuration>
</dfxml>
Can you check the dev branch ?
Sorry for my previous comment, I am running on the same computer photorec for a failing disk without getting gzip files and thus the run that I showed is picking up that configuration. Is it possible to pass a different configuration to a different photorec run?
Once I figure out how to run it without killing the process, I will check the dev branch. Thanks!
Ok, I found the way using photorec /d recup_dir /cmd CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam fileopt,everything,enable,search
. The tests that I did:
-rw-r--r-- 1 daniel staff 15M Apr 11 11:38 f0000000.gz
-rw-r--r-- 1 daniel staff 4.1M Apr 11 11:38 f0030938.gz
-rw-r--r-- 1 daniel staff 4.4M Apr 11 11:38 f0039252.gz
-rw-r--r-- 1 daniel staff 39M Apr 11 11:38 f0048275.gz
-rw-r--r-- 1 daniel staff 1.4M Apr 11 11:38 f0128504.gz
-rw-r--r-- 1 daniel staff 6.2M Apr 11 11:38 f0131315.gz
-rw-r--r-- 1 daniel staff 5.1M Apr 11 11:38 f0143990.gz
-rw-r--r-- 1 daniel staff 815K Apr 11 11:38 f0154341.gz
-rw-r--r-- 1 daniel staff 3.0K Apr 11 11:38 report.xml
-rw-r--r-- 1 daniel staff 76M Apr 11 11:39 f0000000.bam
-rw-r--r-- 1 daniel staff 1.6K Apr 11 11:39 report.xml
-rw-r--r-- 1 daniel staff 76M Apr 11 11:46 f0000000.bam
-rw-r--r-- 1 daniel staff 1.6K Apr 11 11:46 report.xml
And the MD5 is the same for all the files: 1cb7aa6facf25bb759b1e1d00dd19a3d
For testing if bgzip for a non-BAM file is also split or not, I took the BAM file, convert it to plain text using samtools and bgzip it. The command was samtools view -h CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam | bgzip -c > CEUTrio.HiSeq.WGS.b37.NA12878.20.21.sam.bgz
.
In this case, the result was:
-rw-r--r-- 1 daniel staff 6.3M Apr 11 11:49 f0000000.gz
-rw-r--r-- 1 daniel staff 16M Apr 11 11:49 f0012983.gz
-rw-r--r-- 1 daniel staff 4.9M Apr 11 11:49 f0045914.gz
-rw-r--r-- 1 daniel staff 7.6M Apr 11 11:49 f0055896.gz
-rw-r--r-- 1 daniel staff 5.1M Apr 11 11:49 f0071394.gz
-rw-r--r-- 1 daniel staff 556K Apr 11 11:49 f0081749.gz
-rw-r--r-- 1 daniel staff 11M Apr 11 11:49 f0082860.gz
-rw-r--r-- 1 daniel staff 5.6M Apr 11 11:49 f0106065.gz
-rw-r--r-- 1 daniel staff 1.4M Apr 11 11:49 f0117549.gz
-rw-r--r-- 1 daniel staff 9.7M Apr 11 11:49 f0120481.gz
-rw-r--r-- 1 daniel staff 883K Apr 11 11:49 f0140341.gz
-rw-r--r-- 1 daniel staff 216K Apr 11 11:49 f0142107.gz
-rw-r--r-- 1 daniel staff 1.9M Apr 11 11:49 f0142539.gz
-rw-r--r-- 1 daniel staff 4.0K Apr 11 11:49 report.xml
-rw-r--r-- 1 daniel staff 72M Apr 11 11:49 f0000000.gz
-rw-r--r-- 1 daniel staff 1.6K Apr 11 11:49 report.xml
So it also works for other block-compressed files where the format is not known (in your branch, BAM/BAI/CSI). Nevertheless, I recommend that the file extension used in this case is bgz
, to show that it was detected as a block-compressed file. It will be easier for recovery, because knowing if it was detected as bgzip or gzip might help identifying files (e.g., some formats in bioinformatics are compressed with gzip, such FASTA, and other formats should always be compressed as bgzip).
Thanks a lot for the work done here.
Can you change https://github.com/cgsecurity/testdisk/commit/cdde95797ee04b258a7fd29fd4ebcb69f32da74b#diff-bcb8aa815b8b17b77dac79ecc7656e8eR107 to set the extension to bgz
? I think that will be perfect (I tested that normal gzipped files are set to gz
).
Thanks a lot for the help and the quick fix!
Done, I have modified the extension in the dev branch.
Thank you! This software is awesome and it is great your commitment for support new formats!
Is there any plan to include the changes in the next release? And to make a new release soon?
I have uploaded a new 7.1-WIP (source + binaries) with those changes.
Thanks!
7.1 will succeed to 7.1-WIP when it will be released
I am trying to add a custom formats to
photorec.sig
for common bioinformatic files (see http://samtools.github.io/hts-specs/ for more information on some of them), but it is quite difficult because some of the formats are compressed with bgzip (an extension of gzip based on blocks).It looks like some compressed signatures are directly implemented in the
.gz
format (e.g.,xml.gz
) and thus a new compressed format can be added by modifying https://github.com/cgsecurity/testdisk/blob/master/src/file_gz.c Nevertheless, this does not allow to identify any kind of file compressed (with.gz
or other algorithm) and thus is difficult to extend. In addition, it might be difficult for users to create aphotorec.sig
for a compressed extension where the signature is a string.It will be nice if the
photorec.sig
can have some kind of mechanism to indicate compressed signatures with certain algorithms, or to add better extensibility of compressed formats (e.g., a boolean field for thefile_hint_t
struct to indicate that it might use first a compressed format, and then identify the signature after decompression).Thanks in advance!