ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
510 stars 112 forks source link

gbz file for vg rna #1038

Open xuxingyubio opened 1 year ago

xuxingyubio commented 1 year ago

I'm very sorry to bother you.

Recently, I choosed alt contigs on a chromosome to construct a pangenome which is similar to construct GRCh38 Alts Graph.

When I used a gbz file , gencode. v38. primary.gff, and alt.annotation. gff3 as input,the following error occurred: [vg rna] Adding transcript splice-junctions and exon boundaries to graph ... ERROR: Chromomsome path "chr1" not found in graph or haplotypes index (line 8).

I found that the obtained gfa file does not have paths, only walks.I don't know if this is related to the error.

How can I deal this problem?

glennhickey commented 1 year ago

You can see the paths in your graph by using vg paths -Lx <graph.gbz> . All walks in your GFA will show up as paths in the graph.

There must be a naming difference between your graph and gff that is behind this error. Hopefully the output of vg paths will help determine it.

xuxingyubio commented 1 year ago

The output of vg paths is strange. The format of the output is like this: SAMPLE#HAPLTOYPE#CONTIG#0. I don't know if this is reasonable and if this is related to the process of building a pangenome. I used 'cactus-pangenome' directly.

Now, I try to constructed HPRC Graph. When I use the command 'cactus-preprocess ./js_split grch38_pan_new.seqfile ./hprc-pg/grch38_MHC_pan_new.seqfile --configFile ./config_cut_hash.xml --realTimeLogging --brnnCores 24 --logFile grch38_MHC_pan_new.pp.log' , it indicates that out of disk, the disk is low. When I set the TMPDIR to save intermediate files, it cause a error: Got exit code 1 (indicating failure) from job _toil_worker MergeChunks file:/home/users/xyxu/panpipeline/pansplit/js_split kind-MergeChunks/instance-svlvfnck. Job failed with exit value 1: 'MergeChunks' kind-MergeChunks/instance-svlvfnck v1 Exit reason: None The job seems to have left a log file, indicating failure: 'MergeChunks2' kind-MergeChunks/instance-svlvfnck v5 Log from job "kind-MergeChunks/instance-svlvfnck" follows: =========> [2023-06-04T17:52:50+0800] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG--- [2023-06-04T17:52:50+0800] [MainThread] [I] [toil] Running Toil version 5.9.2-54bfe0b146b76ecc6221de384c255e1be89547c6 on host node13. [2023-06-04T17:52:50+0800] [MainThread] [I] [toil.worker] Working on job 'MergeChunks' kind-MergeChunks/instance-svlvfnck v1 [2023-06-04T17:52:56+0800] [MainThread] [I] [toil.worker] Loaded body Job('MergeChunks' kind-MergeChunks/instance-svlvfnck v1) from description 'MergeChunks' kind-MergeChunks/instance-svlvfnck v1 [2023-06-04T17:52:56+0800] [MainThread] [W] [toil.job] Preemptable as a keyword has been deprecated, please use preemptible. [2023-06-04T17:52:56+0800] [MainThread] [I] [toil.job] Saving graph of 2 jobs, 1 new [2023-06-04T17:52:56+0800] [MainThread] [I] [toil.job] Processing job 'MergeChunks2' kind-MergeChunks2/instance-jum4sr2h v0 [2023-06-04T17:52:56+0800] [MainThread] [I] [toil.job] Processing job 'MergeChunks' kind-MergeChunks/instance-svlvfnck v1 [2023-06-04T17:52:56+0800] [MainThread] [I] [toil.worker] Completed body for 'MergeChunks' kind-MergeChunks/instance-svlvfnck v3 [2023-06-04T17:52:56+0800] [MainThread] [I] [toil.worker] Chaining from 'MergeChunks' kind-MergeChunks/instance-svlvfnck v3 to 'MergeChunks2' kind-MergeChunks2/instance-jum4sr2h v1 [2023-06-04T17:52:56+0800] [MainThread] [I] [toil.worker] Working on job 'MergeChunks2' kind-MergeChunks/instance-svlvfnck v3 [2023-06-04T17:52:56+0800] [MainThread] [I] [toil.worker] Loaded body Job('MergeChunks2' kind-MergeChunks/instance-svlvfnck v3) from description 'MergeChunks2' kind-MergeChunks/instance-svlvfnck v3 Traceback (most recent call last): File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/toil/jobStores/fileJobStore.py", line 533, in read_file os.link(jobStoreFilePath, local_path) FileExistsError: [Errno 17] File exists: '/home/users/xyxu/panpipeline/pansplit/js_split/files/for-job/kind-LastzRepeatMaskJob/instance-8i3_p7_9/file-9180904b15d8423abafc7bebf4236a86/HG02055.1_0.maskedQeury' -> '/home/users/xyxu/tmp/d397bf4f9f945b43a0ee6743fa184d9b/17a4/a59d/tmpzi9jdxca.tmp'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/toil/worker.py", line 403, in workerScript
        job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
      File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/cactus/shared/common.py", line 912, in _runner
        super(RoundedJob, self)._runner(*args, jobStore=jobStore,
      File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/toil/job.py", line 2743, in _runner
        returnValues = self._run(jobGraph=None, fileStore=fileStore)
      File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/toil/job.py", line 2660, in _run
        return self.run(fileStore)
      File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/cactus/preprocessor/cactus_preprocessor.py", line 117, in run
        chunkList = [readGlobalFileWithoutCache(fileStore, fileID) for fileID in self.chunkIDList]
      File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/cactus/preprocessor/cactus_preprocessor.py", line 117, in <listcomp>
        chunkList = [readGlobalFileWithoutCache(fileStore, fileID) for fileID in self.chunkIDList]
      File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/cactus/shared/common.py", line 922, in readGlobalFileWithoutCache
        fileStore.jobStore.readFile(jobStoreID, f)
      File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/toil/lib/compatibility.py", line 12, in call
        return func(*args, **kwargs)
      File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/toil/jobStores/abstractJobStore.py", line 1273, in readFile
        return self.read_file(jobStoreFileID, localFilePath, symlink)
      File "/home/users/xyxu/pantools/cactus-bin-v2.5.2/cactus_env/lib/python3.9/site-packages/toil/jobStores/fileJobStore.py", line 543, in read_file
        os.link(jobStoreFilePath, local_path)
    PermissionError: [Errno 1] Operation not permitted: '/home/users/xyxu/panpipeline/pansplit/js_split/files/for-job/kind-LastzRepeatMaskJob/instance-8i3_p7_9/file-9180904b15d8423abafc7bebf4236a86/HG02055.1_0.maskedQeury' -> '/home/users/xyxu/tmp/d397bf4f9f945b43a0ee6743fa184d9b/17a4/a59d/tmpzi9jdxca.tmp'
    [2023-06-04T17:52:56+0800] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host node13

<=========

glennhickey commented 1 year ago

Goodness, that's an error happening in Toil's file localization. I'm worried that something may have become corrupt when you first ran out of disk space, and that your job is in an un-resumable state.

As a general comment, the whole dna-brnn masking pipeline got used to make the released graph, but is deprecated at this point. If you want to exactly reproduce that graph, you should use the commits referred to in the papers. If you want to make a new graph of the same data, you'd be best served by just running cactus-pangenome.

xuxingyubio commented 1 year ago

I try to see the paths in my graph by using vg paths -Lx . The format of the output is like this: SAMPLE#HAPLTOYPE#CONTIG#0. GRCh38#0#chr6 GRCh38#0#chr1 HG002#1#JAHKSE010000066.1#0 ...

When I changed the annotation result obtained from CAT to the above format, it still reported an error like this: ERROR: Chromomsome path "HG002#1#JAHKSE010000066.1#0" not found in graph or haplotypes index (line 2).

How can I solve this problem?

glennhickey commented 1 year ago

Cat runs on the .hal file, not the .gbz. The path names may appear differently in the two formats. Use halStats and halStats --sequenceStats to find the names in your hal file.